|
Database | Introduction |
SIR/XS database management is covered under a number of topics:
Virtually all of the features are available through the menus and dialogs and this is the normal way to manage a database on a regular basis.
SIR/XS provides a concurrent environment if multiple people need to update a database at the same time across a network.
A database is made up of one or more record types. Each record type contains variables. Each record type may have one or more variables defined as keys. Keys are used to identify each individual record.
A record may have secondary indexes allowing access by VisualPQL and PQLForms through values in non-key variables.
The definition of the database consists of overall information about the database plus definitions for each record type with definitions of each variable and any secondary indexes. This is collectively known as the Schema.
Records and variables are retrieved by name and applications do not need to know anything about the physical organisation of data. i.e. Applications are independent of the physical structure of data.
Operating system files used by SIR/XS applications are assigned an internal name known as an Attribute. This is used in place of the full filename. If specifying full operating system filenames, it is best to enclose these in quotes.
Every application that accesses data from a database uses the schema, which means that the information is always verified and conforms to the schema definition. An application can access any record type within a database and can access multiple record types for complex processing. The schema can be defined through interactive dialogs or by creating a set of commands and running them in a similar way to a running a program. The resulting database is exactly the same regardless of the way that it is defined.
The database definition can be modified even after the data has been loaded without, in many cases, having to
There is a second method to store data known as Tables. A table holds a single record type and is independent of any database. Tables are held on tabfiles. A tabfile can hold many tables. An application can operate on many tables and many tabfiles.
Some databases have a natural structure known as a Case structure. For example, a medical database with information about patients could have a number of record types such as patient demographics, visits, procedures, followups, etc., most with multiple occurrences but all of the information about a single individual makes up a case. i.e. A case is a set of records that all refer to one single entity.
A case structure can be easier to use. Queries are simpler to write and less error prone within a case. A case structure can be faster. Since all of the information for any given case is available with a minimum of searching, access is very fast.
A case structure can be found in many applications. The following guidelines may help determine whether a particular set of data has a case structure.
The main index that SIR/XS maintains to locate the records is built from the key and the key determines how the record is physically stored in the database.
Because the keys go into the index, they are relatively more "expensive" to maintain than non-key variables. Do not declare more keyfields than needed to make records unique or to define relationships. Whenever choosing variables to use for a key, use variables that are short and well defined. Use numeric codes wherever possible, particularly in place of long alphabetic strings. Using strings as keys leads to misspellings and possible confusions as well as being inefficient. Avoid floating point numbers as key fields.
Secondary indexes can be used to retrieve data in other sequences.
Specifying the complete key in a retrieval is the most efficient way to retrieve a single record. This can use the main index and directly retrieve the record.
A retrieval can locate records using just part of the key providing that it specifies the higher level key variables. Make higher level keys the ones more likely to be used to retrieve by.
Within a record type, variable names must be unique. Variables in different record types may have the same name. Beware of using the same name for different things in different record types in the same database. For example, it would be a mistake to use the name '
A categorical variable is a character string that has a limited number of values specified as an ordered list. When the variable is input as a string, it is compared to the list and the number that corresponds to the matching position in the list is stored instead of the value. This has the advantage that only valid entries are held and considerable space is saved. In programs and reports, the full string is displayed and retrieved.
For example, a categorical variable might be a list of the names of American states. If 'Alabama' were the first entry in the list, a variable entered as 'Alabama' results in '1' being stored.
The list is held in the data dictionary and is searched sequentially. It is a very simple and easy to use facility for short lists that are not updated very often. If there are hundreds or thousands of entries, or there is more information about each entry than just the name, or users need to modify the entries, use tables or record types with indexes to store this type of reference data.
The date format is a string, up to 32 characters, in quotes and describes both an input and an output format. The input format is used when data is read during batch data entry, or when a string value is assigned to the variable. The format is used to interpret the input data (see below). The output format is used as a default when the variable is written and the output date corresponds exactly to the format specified (this can be overridden by other date specifications at that point).
The date format consists of a combination of letters with special meanings and other characters used as separators. The letters M(month), D(day), Y(year), W(day of week/week number) and I (default separator) are special characters (upper/lower case can be used); all other characters in the format are treated as separators. The 'I' separator results in a blank separator on output. Like characters must be kept together, e.g. a format such as Date formats can be partial formats, without a day, month or year component. If the year is omitted, it is taken to be the current year. If the decade is omitted, it is taken to be the current decade. If the day or the month is omitted, they are taken to be 1. If there are more than two month characters, then English month names are used to the length specified. Names on output are in upper/lower case to match the case of the format. There are two special formats which do not have a month component. A date can be a year/day format (in either sequence) which must allow at least three characters for days (1-365). A date can be a year /week format (in either sequence) which should allow at least two characters for week numbers (1-53). (Week numbers follow the ISO standard; a week always begins on a Monday, and ends on a Sunday. The first week of a year is that week which contains the first Thursday of the year, or, equivalently, contains Jan-4.) The 'W' character, if not in a week number format, represents the English name for days of the week to the length specified. Names on output are in upper/lower case to match the case of the format. If month names or days of the week names are specified, on output all dates have the same length name. If the specific name is shorter than the format it is padded with blanks.
On input, if a date format has no separators and the input value has no separators, the input must correspond exactly to the format. e.g. format
Months can be input as names (or partial names) or as month numbers regardless as to the format. Days of the week have no relevance on input and any text is skipped.
If the full four digits of the year are not entered, the input is tested against the century split parameter
For example, various date maps allow a sample of possible inputs and how the date is displayed as follows:
The time format is a string, up to 32 characters, in quotes and describes both an input and an output format. The input format is used when data is read during batch data entry, or when a string value is assigned to the variable. The format is used to interpret the input data (see below). The output format is used as a default when the variable is written and the output time corresponds exactly to the format specified (this can be overridden by other time specifications at that point).
The time format consists of a combination of letters with special meanings and other characters used as separators. The letters H(hour), M(minute), S(second), I(default separator) and P or A (AM/PM) (upper/lower case can be used); all other characters in the format are treated as separators. The 'I' separator results in a colon : separator on output. Like characters must be kept together, e.g. a format such as
A time is normally taken to be a 24 hour time format. Specify 'PP' ('A' is a synonym for 'P') for a 12 hour format. On output 'AM' or 'PM' are written as appropriate. On input, a 'P' in the data indicates a 12 hour time at or after midday (a 24 hour time is still valid on input).
On input, if a time format has no separators and the input has no separators, the input must correspond exactly to the format. e.g. format
For example, various time maps allow a sample of possible inputs and how the time is displayed as follows:
The software handles any scaling issues involved. For example, if a scaled integer has a scaling factor of 2 (hundreds), and it is computed equal to 50 x 4, the database stores a 2. If the variable is printed, 200 is output. If used in another calculation, it would be 200. For all practical purposes it has the value 200, except it saves storage space.
The major limitation on scaled integers is that the maximum integer value is 2,147,483,643. If scaled to a negative power, this may not be large enough. For example, scaling to -2 for money, gives a maximum value of 21,474,836.43. (For larger money values, use whole cents in a double precision floating point variable:R*8)
For most data definition purposes, specify how the variable looks externally and the appropriate internal format is created automatically. However it is as well to understand these transformations.
To input a negative number (whether integer or floating point), simply precede the number with a minus sign (-).
For example, a variable might have two valid values such as 'Y', and 'N' meaning 'Yes' and 'No'. A further three missing values might be defined such as 'X' for 'Not relevant', 'R' for 'Refused to Answer' and 'Z' for 'Invalid Answer'. When producing statistics on that variable, only the Y and N answers are included as the others are defined as missing values.
Blanks may be declared as a missing value. If a numeric field is blank on input and blank has been defined as a missing value, the variable is stored as missing. If blanks are not declared as a missing value for the variable, zero is stored when a numeric field is blank on input.
Some existing computer files may not have explicit decimal points but may have an implied decimal point at a given position. For example, a format of F4.1 or a scale of (-1) means that the number in these four positions has one decimal place if an explicit decimal point is not quoted. If this field contained "0012", it is read as "1.2", however if it contained 1.234, then 1.234 is the value that would be stored.
The decimal positions describe an implicit input format and an implicit output format. They do not describe the maximum number of decimal positions that can be stored in a field. If a number is defined as D10.0, it can still be computed equal to "0.12345" and .12345 is stored.
On some schema definition commands (
Overview
There can be any number of different databases for different applications. A SIR/XS session can have any number of databases connected at any one time and one of these is the current or default database. Any database operations happen on the default database. SQL, Forms, PQLForms and VisualPQL can access multiple databases.Names
All of the various SIR/XS entities such as records and variables have names that must conform to the same set of rules. A name can be a standard name that is up to 32 characters long, does not begin with a number and contains only letters, numbers and four characters (dollar sign $, hash sign #, at sign @, underscore _). Letters in standard names are translated to uppercase. A non-standard name can be used that does not conform to these rules, for example it might contain lower case characters or have embedded spaces. Non-standard names can be up to 30 characters and must be enclosed in curly brackets {...}. Non-standard names can contain any character (except curly brackets) and, where entities are listed by name, non-standard names are in the appropriate sort sequence position in the list.UNLOAD
and
RELOAD
data.CASE Structured Databases
SIR/XS is a true relational database and databases can be created that are simply sets of record types. However SIR/XS also allows databases that can take advantage of structure in the data:The Common Information Record
In a case structured database some information is maintained about each case. This is held in the Common Information Record or CIR. The CIR is a summary of data about the case. It holds the case identifier, counts of records belonging to the case and other specified common variables. It typically holds values that only occur once in the case although it can hold a copy of the last value entered for variables that occur multiple times in the case.CASELESS Databases
Records in a caseless database have no single common element that relates them. If a single database has various records types that are each independent from the other (e.g. Parts, Customers, Employees), then it might be a caseless database or perhaps the various entities might be held in individual case structured databases.Keys
Each record in a SIR/XS database is unique and that uniqueness is defined in terms of the values of keys. A record may have one or more variables that make up the key. No two records in the database can have the same combination of record type and key values.Order of Variables in Keys
In keys made up of multiple variables, the order in which the variables are specified in the key is important.Joining Records with Keyfields
Data from two record types can be joined by using a keyfield that is common to both record types. If keyfields in different record types have the same name and are in the same order, these record types are implicitly joined. A record type with one extra keyfield defines a hierarchy. For example, one record type may have CUSTNO
as the key, a second record type CUSTNO
and ORDERNO
, a third record type CUSTNO
, ORDERNO
, LINENO
, etc. These can go down through many levels of record types if needed, and the implicit joins
make retrievals simpler and more efficient.Defining Variables
Every variable in a database has a name and a structured definition. Whenever a variable is entered into the database, it is checked to ensure that it conforms to the definition. Variables are always referred to by their name. Descriptive names are usually best. For example, a variable called 'PHONE
' is easier to remember than one called 'VAR112
'.DATE
' for variables when they are different dates, such as Birth Date, Visit Date, etc.Variable Data Types
Every variable is one of three basic data types:INTEGER
FLOATING POINT
STRING
Variable Extended Data Types
Although every variable is stored as one of the three basic types, there are several extended data types that translate information as they are entered or retrieved. These are:CATEGORICAL
DATE
'MM/DD/YYYY'
that is used to interpret the input and to format the output. Within PQL programs, dates can be used as numbers for calculations or as character strings for display or input purposes. As dates are stored as the number of days since a predetermined point in the past, it is very simple to perform date based calculations and there are no potential problems at the end of centuries or millennia. There is currently an upper limit of Feb 28 3000 (day 517686) on the conversion of numeric days to/from the calendar.'MYYYM'
is invalid.'DDMMYYYY'
input must have leading zeroes when necessary such as '05062006'. If the input has separators, each component is taken to be variable length up to the separator or to the end of the input field regardless as to the specification of separators in the format. e.g. for format 'DD/MM/YYYY'
or format 'DDMMYYYY'
input could be '5/6/6'. Separators do not have to match specific characters. e.g. A '/' can be specified in the format and the input could contain a blank.CENY
. This is 1930 by default and can be set by the application. An input year greater than the century split is set to the specified century; years less than this are set to the next century. e.g. 99 becomes 1999; 7 becomes 2007. This calculation is not dependent on the current date in the system and therefore does not alter at any particular point in time.
Format Possible Inputs Displayed Date
------------------ ---------------- ----------------
'mmddyyyy' 05312006 05312006
'mmddyyyy' 5 31 2006 05312006
'MMIDDIYY' 5/31/6 or 5 31 2006 05 31 06
'MM/DD/YYYY' 5-31-6 or 5 31 06 05/31/2006
'DD-MM-YY' 31/5/2006 or 31 May 6 31-05-06
'YYYY' 2006 or 6 2006
'MMM DD, YY' As per other M/D/Y formats MAY 31, 06
'WWW, DD MMM YYYY' 31/05/06 or xxx 31/5/2006 FRI, 31 MAY 2006
'Mmm/DD/YYYY' As per other M/D/Y formats May/31/2006
'Www, Mmm dddd' xxxx 05/31 Fri, May 31st
'yyyy/ww 6/13 2006/13 (week number)
'ddd/yy' 85/2006 085/06 (day number)
If a datemap starts with the letter "E" then this is an exact date map and the value input into the date variable must match the map exactly.
The E flag is not a part of the input format but indicates that the following format is mandatory: digits must be entered for each M,D and Y in the map and their positions map exactly to the column positions in the datemap. Leading zeros must be entered. Numbers cannot be entered in columns that are not mapped to M, D or Y.
Format Input Value
------------------ -------------- ----------------
'Emmddyyyy' 05312006 accepted
5312006 rejected (needs leading zero)
05 31 2006 rejected (misaligned columns)
'EMMIDDIYY' 05/31/06 accepted
05031/06 rejected (number out of place)
'EMMMIDDIYYYY' MAR 31 2006 rejected (numbers required)
05 31 2006 rejected (need all leading zeros)
005 31 2006 accepted
TIME
'HHMMSS'
that is used to interpret the input and format the output.'HMMH'
is invalid.'HHMM'
input must have leading zeroes when necessary such as '0805'. If the input has separators, each component is taken to be variable length up to the separator or to the end of the input field. e.g. for format 'HH:MM'
or 'HHMM'
input could be '8:5'. Separators do not have to match specific characters. e.g. A : can be specified in the format and the input could contain a blank.
Format Possible Inputs Displayed Time
------------------ ---------------- ----------------
'hhmm' 2330 23-30 2330
'hhImm' 23 30 23-30 23:30
'HH:MM:SS' 23 30 23:30:00
'HH MM PP' 23 30 11:30 P 11 30 PM
'HH MM PP' 23 30 AM 11 30 PM
'HH MM PP' 12 00 PM 12 00 PM (noon)
'HH MM PP' 12 00 AM 12 00 AM (midnight)
If a timemap starts with the letter "E" then this is an exact time map and the value input into the variable must match the map exactly.
The E flag is not a part of the input format but indicates that the following format is mandatory: digits must be entered for each H,M and S in the map and their positions map exactly to the column positions in the timemap. Leading zeros must be entered. Numbers cannot be entered in columns that are not mapped to H, M or S.
Format Input Value
------------------ -------------- ----------------
'Ehhmmss ' 083700 accepted
83705 rejected (needs leading zero)
08 37 20 rejected (misaligned columns)
'EHH:MM:SS' 08/37/46 accepted (separator types are not significant)
08037:46 rejected (number out of place)
SCALED
Variable Formats
Variables can be defined in terms of an external format. These formats are:Aw
Iw
Date 'format'
Time 'format'
Fw.n
Variable Size
The size of a variable is defined by either the external format or internal storage type:Variable Internal Formats
Variables are stored as:I* 1 | 2 | 4
I*1 have a range of -128 to +123;
I*2 have a range of -32,768 to +32,763;
I*4 have a range of -2,147,483,648 to 2,147,483,643.R* 4 | 8
Variable Quality Control
Whenever variables are entered, they are checked to make sure that the input conforms to the defined data type, size and format. Several checks may be specified:Valid Values
Variable Ranges
Missing Values
Batch Data Input
Variable Label
Variables on screen displays and printed output can be displayed with the variable name or with an optional, 78 character label.Variable Documentation
The schema can hold documentation about a variable. There can be as many lines of documentation about a variable as necessary. This is simply stored with the rest of the definition of the variable and is listed as required.Value Labels
Labels can be defined for specific values of variables. For instance, for a variable Gender, the value 1 may have the label 'Male' and the value 2 may have the label 'Female'. These labels may be used in reports, etc. but generally, internally within programs, the numeric values are used.Decimal Points
When reading numbers from external files or specifying numbers in definitions or programs, an actual decimal point "." can be present in the number. In definitions and programs, if the decimal point is omitted, it is assumed at the right of the number. For example:
COMPUTE A = 1.2
COMPUTE B = 100
VALUE LABELS, VAR RANGES, MISSING VALUES
etc.), numbers can be specified. If the number has a decimal component, specify an explicit decimal point and the actual value regardless of any input definition of implied decimals. For example:
MISSING VALUES RESULT (99.99)