Home	Starting	Environment	DBMS	VisualPQL	Procedures	SQL	Forms	Host/API	Index

Introduction

SIR/XS Databases

This section of the documentation describes how to manage databases using SIR/XS. Anyone intending to create or change the definitions of a database or act as a database administrator should be familiar with this material. If you are simply going to access SIR/XS databases using applications developed by someone else, then the Database Overview provides a good introduction.

SIR/XS database management is covered under a number of topics:

Introduction outlines the major database facilities in SIR/XS.
Database Definition Commands describes the commands to define databases and records. Use these same commands to modify existing definitions together with some additional commands that only apply to updating existing definitions.
File Input describes utilities that load data into a database from text files.
Writing Data and Backing Up describes utilities that produce copies of a database and the verification and recovery utilities, including utilities to upload changes from another database, to merge data from another database and to reload a database.
Tabfiles and Tables provides an overview of the various facilities available for the creation and maintenance of tabfiles and tables.
Tuning and Efficiency discusses the internals of the SIR/XS database physical organisation.

Virtually all of the features are available through the menus and dialogs and this is the normal way to manage a database on a regular basis.

SIR/XS provides a concurrent environment if multiple people need to update a database at the same time across a network.

Overview

There can be any number of different databases for different applications. A SIR/XS session can have any number of databases connected at any one time and one of these is the current or default database. Any database operations happen on the default database. SQL, Forms, PQLForms and VisualPQL can access multiple databases.

A database is made up of one or more record types. Each record type contains variables. Each record type may have one or more variables defined as keys. Keys are used to identify each individual record.

A record may have secondary indexes allowing access by VisualPQL and PQLForms through values in non-key variables.

The definition of the database consists of overall information about the database plus definitions for each record type with definitions of each variable and any secondary indexes. This is collectively known as the Schema.

Names

All of the various SIR/XS entities such as records and variables have names that must conform to the same set of rules. A name can be a standard name that is up to 32 characters long, does not begin with a number and contains only letters, numbers and four characters (dollar sign $, hash sign #, at sign @, underscore _). Letters in standard names are translated to uppercase. A non-standard name can be used that does not conform to these rules, for example it might contain lower case characters or have embedded spaces. Non-standard names can be up to 30 characters and must be enclosed in curly brackets {...}. Non-standard names can contain any character (except curly brackets) and, where entities are listed by name, non-standard names are in the appropriate sort sequence position in the list.

Records and variables are retrieved by name and applications do not need to know anything about the physical organisation of data. i.e. Applications are independent of the physical structure of data.

Operating system files used by SIR/XS applications are assigned an internal name known as an Attribute. This is used in place of the full filename. If specifying full operating system filenames, it is best to enclose these in quotes.

Every application that accesses data from a database uses the schema, which means that the information is always verified and conforms to the schema definition. An application can access any record type within a database and can access multiple record types for complex processing. The schema can be defined through interactive dialogs or by creating a set of commands and running them in a similar way to a running a program. The resulting database is exactly the same regardless of the way that it is defined.

The database definition can be modified even after the data has been loaded without, in many cases, having to UNLOAD and RELOAD data.

There is a second method to store data known as Tables. A table holds a single record type and is independent of any database. Tables are held on tabfiles. A tabfile can hold many tables. An application can operate on many tables and many tabfiles.

CASE Structured Databases

SIR/XS is a true relational database and databases can be created that are simply sets of record types. However SIR/XS also allows databases that can take advantage of structure in the data:

Some databases have a natural structure known as a Case structure. For example, a medical database with information about patients could have a number of record types such as patient demographics, visits, procedures, followups, etc., most with multiple occurrences but all of the information about a single individual makes up a case. i.e. A case is a set of records that all refer to one single entity.

A case structure can be easier to use. Queries are simpler to write and less error prone within a case. A case structure can be faster. Since all of the information for any given case is available with a minimum of searching, access is very fast.

A case structure can be found in many applications. The following guidelines may help determine whether a particular set of data has a case structure.

If most of the information belongs to some given single entity. For example; is the information all about Employees, or Prisoners, or Drill Holes, or Patients, or Samples ...?
Is the information entered from forms and stored in a number of record types that all apply to a single entity? For example, with questionnaire data, the questionnaire number may be a case identifier.
Is there some basic reporting unit in the application? If so, this unit might well be the case. Are results all about a given site or a particular experiment?
Do most or all of the records have a hierarchical structure branching down from a single record.
Is there a key variable in common across all record types that would qualify as a case identifier?

The Common Information Record

In a case structured database some information is maintained about each case. This is held in the Common Information Record or CIR. The CIR is a summary of data about the case. It holds the case identifier, counts of records belonging to the case and other specified common variables. It typically holds values that only occur once in the case although it can hold a copy of the last value entered for variables that occur multiple times in the case.

CASELESS Databases

Records in a caseless database have no single common element that relates them. If a single database has various records types that are each independent from the other (e.g. Parts, Customers, Employees), then it might be a caseless database or perhaps the various entities might be held in individual case structured databases.

Keys

Each record in a SIR/XS database is unique and that uniqueness is defined in terms of the values of keys. A record may have one or more variables that make up the key. No two records in the database can have the same combination of record type and key values.

The main index that SIR/XS maintains to locate the records is built from the key and the key determines how the record is physically stored in the database.

Because the keys go into the index, they are relatively more "expensive" to maintain than non-key variables. Do not declare more keyfields than needed to make records unique or to define relationships. Whenever choosing variables to use for a key, use variables that are short and well defined. Use numeric codes wherever possible, particularly in place of long alphabetic strings. Using strings as keys leads to misspellings and possible confusions as well as being inefficient. Avoid floating point numbers as key fields.

Secondary indexes can be used to retrieve data in other sequences.

Order of Variables in Keys

In keys made up of multiple variables, the order in which the variables are specified in the key is important.

Specifying the complete key in a retrieval is the most efficient way to retrieve a single record. This can use the main index and directly retrieve the record.

A retrieval can locate records using just part of the key providing that it specifies the higher level key variables. Make higher level keys the ones more likely to be used to retrieve by.

Joining Records with Keyfields

Data from two record types can be joined by using a keyfield that is common to both record types. If keyfields in different record types have the same name and are in the same order, these record types are implicitly joined. A record type with one extra keyfield defines a hierarchy. For example, one record type may have CUSTNO as the key, a second record type CUSTNO and ORDERNO, a third record type CUSTNO, ORDERNO, LINENO, etc. These can go down through many levels of record types if needed, and the implicit joins make retrievals simpler and more efficient.

Defining Variables

Every variable in a database has a name and a structured definition. Whenever a variable is entered into the database, it is checked to ensure that it conforms to the definition. Variables are always referred to by their name. Descriptive names are usually best. For example, a variable called 'PHONE' is easier to remember than one called 'VAR112'.

Within a record type, variable names must be unique. Variables in different record types may have the same name. Beware of using the same name for different things in different record types in the same database. For example, it would be a mistake to use the name 'DATE' for variables when they are different dates, such as Birth Date, Visit Date, etc.

Variable Data Types

Every variable is one of three basic data types:

INTEGER: Integers are the set of natural numbers both positive and negative including zero.

Variable Extended Data Types

Although every variable is stored as one of the three basic types, there are several extended data types that translate information as they are entered or retrieved. These are:

CATEGORICAL

Categorical variables offer an efficient way to store predefined strings.

A categorical variable is a character string that has a limited number of values specified as an ordered list. When the variable is input as a string, it is compared to the list and the number that corresponds to the matching position in the list is stored instead of the value. This has the advantage that only valid entries are held and considerable space is saved. In programs and reports, the full string is displayed and retrieved.

For example, a categorical variable might be a list of the names of American states. If 'Alabama' were the first entry in the list, a variable entered as 'Alabama' results in '1' being stored.

The list is held in the data dictionary and is searched sequentially. It is a very simple and easy to use facility for short lists that are not updated very often. If there are hundreds or thousands of entries, or there is more information about each entry than just the name, or users need to modify the entries, use tables or record types with indexes to store this type of reference data.

DATE

A date variable is typically entered as a character string which is converted to an integer that is the number of days since the start of the Gregorian calendar. Day 1 is October 15, 1582. A date has a format such as 'MM/DD/YYYY' that is used to interpret the input and to format the output. Within PQL programs, dates can be used as numbers for calculations or as character strings for display or input purposes. As dates are stored as the number of days since a predetermined point in the past, it is very simple to perform date based calculations and there are no potential problems at the end of centuries or millennia. There is currently an upper limit of Feb 28 3000 (day 517686) on the conversion of numeric days to/from the calendar.

The date format is a string, up to 32 characters, in quotes and describes both an input and an output format. The input format is used when data is read during batch data entry, or when a string value is assigned to the variable. The format is used to interpret the input data (see below). The output format is used as a default when the variable is written and the output date corresponds exactly to the format specified (this can be overridden by other date specifications at that point).

The date format consists of a combination of letters with special meanings and other characters used as separators. The letters M(month), D(day), Y(year), W(day of week/week number) and I (default separator) are special characters (upper/lower case can be used); all other characters in the format are treated as separators. The 'I' separator results in a blank separator on output. Like characters must be kept together, e.g. a format such as 'MYYYM' is invalid.

Date formats can be partial formats, without a day, month or year component. If the year is omitted, it is taken to be the current year. If the decade is omitted, it is taken to be the current decade. If the day or the month is omitted, they are taken to be 1. If there are more than two month characters, then English month names are used to the length specified. Names on output are in upper/lower case to match the case of the format. There are two special formats which do not have a month component. A date can be a year/day format (in either sequence) which must allow at least three characters for days (1-365). A date can be a year /week format (in either sequence) which should allow at least two characters for week numbers (1-53). (Week numbers follow the ISO standard; a week always begins on a Monday, and ends on a Sunday. The first week of a year is that week which contains the first Thursday of the year, or, equivalently, contains Jan-4.) The 'W' character, if not in a week number format, represents the English name for days of the week to the length specified. Names on output are in upper/lower case to match the case of the format. If month names or days of the week names are specified, on output all dates have the same length name. If the specific name is shorter than the format it is padded with blanks.

On input, if a date format has no separators and the input value has no separators, the input must correspond exactly to the format. e.g. format 'DDMMYYYY' input must have leading zeroes when necessary such as '05062006'. If the input has separators, each component is taken to be variable length up to the separator or to the end of the input field regardless as to the specification of separators in the format. e.g. for format 'DD/MM/YYYY' or format 'DDMMYYYY' input could be '5/6/6'. Separators do not have to match specific characters. e.g. A '/' can be specified in the format and the input could contain a blank.

Months can be input as names (or partial names) or as month numbers regardless as to the format. Days of the week have no relevance on input and any text is skipped.

If the full four digits of the year are not entered, the input is tested against the century split parameter CENY. This is 1930 by default and can be set by the application. An input year greater than the century split is set to the specified century; years less than this are set to the next century. e.g. 99 becomes 1999; 7 becomes 2007. This calculation is not dependent on the current date in the system and therefore does not alter at any particular point in time.

For example, various date maps allow a sample of possible inputs and how the date is displayed as follows:

Format                    Possible Inputs             Displayed Date
------------------        ----------------            ----------------
'mmddyyyy'                05312006                    05312006
'mmddyyyy'                5 31 2006                   05312006
'MMIDDIYY'                5/31/6 or 5 31 2006         05 31 06
'MM/DD/YYYY'              5-31-6 or 5 31 06           05/31/2006
'DD-MM-YY'                31/5/2006 or 31 May 6       31-05-06
'YYYY'                    2006   or 6                 2006
'MMM DD, YY'              As per other M/D/Y formats  MAY 31, 06
'WWW, DD MMM YYYY'        31/05/06  or xxx 31/5/2006  FRI, 31 MAY 2006
'Mmm/DD/YYYY'             As per other M/D/Y formats  May/31/2006
'Www, Mmm dddd'           xxxx 05/31                  Fri, May 31st
'yyyy/ww                  6/13                        2006/13  (week number)
'ddd/yy'                  85/2006                     085/06   (day number)

If a datemap starts with the letter "E" then this is an exact date map and the value input into the date variable must match the map exactly. The E flag is not a part of the input format but indicates that the following format is mandatory: digits must be entered for each M,D and Y in the map and their positions map exactly to the column positions in the datemap. Leading zeros must be entered. Numbers cannot be entered in columns that are not mapped to M, D or Y.

Format               Input           Value
------------------   --------------  ----------------
'Emmddyyyy'          05312006        accepted
                      5312006        rejected (needs leading zero)
                     05 31 2006      rejected (misaligned columns)
'EMMIDDIYY'          05/31/06        accepted
                     05031/06        rejected (number out of place)
'EMMMIDDIYYYY'       MAR 31 2006     rejected (numbers required)
                      05 31 2006     rejected (need all leading zeros)
                     005 31 2006     accepted

TIME

A time variable is typically entered as a character string and stored as an integer that is the number of seconds since midnight. A time has a time format such as 'HHMMSS' that is used to interpret the input and format the output.

The time format is a string, up to 32 characters, in quotes and describes both an input and an output format. The input format is used when data is read during batch data entry, or when a string value is assigned to the variable. The format is used to interpret the input data (see below). The output format is used as a default when the variable is written and the output time corresponds exactly to the format specified (this can be overridden by other time specifications at that point).

The time format consists of a combination of letters with special meanings and other characters used as separators. The letters H(hour), M(minute), S(second), I(default separator) and P or A (AM/PM) (upper/lower case can be used); all other characters in the format are treated as separators. The 'I' separator results in a colon : separator on output. Like characters must be kept together, e.g. a format such as 'HMMH' is invalid.

A time is normally taken to be a 24 hour time format. Specify 'PP' ('A' is a synonym for 'P') for a 12 hour format. On output 'AM' or 'PM' are written as appropriate. On input, a 'P' in the data indicates a 12 hour time at or after midday (a 24 hour time is still valid on input).

On input, if a time format has no separators and the input has no separators, the input must correspond exactly to the format. e.g. format 'HHMM' input must have leading zeroes when necessary such as '0805'. If the input has separators, each component is taken to be variable length up to the separator or to the end of the input field. e.g. for format 'HH:MM' or 'HHMM' input could be '8:5'. Separators do not have to match specific characters. e.g. A : can be specified in the format and the input could contain a blank.

For example, various time maps allow a sample of possible inputs and how the time is displayed as follows:

Format                    Possible Inputs             Displayed Time
------------------        ----------------            ----------------
'hhmm'                    2330   23-30                2330
'hhImm'                   23 30  23-30                23:30
'HH:MM:SS'                23 30                       23:30:00
'HH MM PP'                23 30  11:30 P              11 30 PM
'HH MM PP'                23 30 AM                    11 30 PM
'HH MM PP'                12 00 PM                    12 00 PM (noon)
'HH MM PP'                12 00 AM                    12 00 AM (midnight)

If a timemap starts with the letter "E" then this is an exact time map and the value input into the variable must match the map exactly. The E flag is not a part of the input format but indicates that the following format is mandatory: digits must be entered for each H,M and S in the map and their positions map exactly to the column positions in the timemap. Leading zeros must be entered. Numbers cannot be entered in columns that are not mapped to H, M or S.

Format               Input           Value
------------------   --------------  ----------------
'Ehhmmss  '          083700          accepted
                     83705           rejected (needs leading zero)
                     08 37 20        rejected (misaligned columns)
'EHH:MM:SS'          08/37/46        accepted (separator types are not significant)
                     08037:46        rejected (number out of place)

SCALED

A scaled integer represent numbers multiplied by a power of ten (the power is the scale factor). These can include a decimal component (a negative scale factor) or can be a number with a set number of trailing zeros (a positive scale factor). This is efficient for data that has this characteristic of a fixed scale. For example, Money in dollars and cents would be scaled at -2.

The software handles any scaling issues involved. For example, if a scaled integer has a scaling factor of 2 (hundreds), and it is computed equal to 50 x 4, the database stores a 2. If the variable is printed, 200 is output. If used in another calculation, it would be 200. For all practical purposes it has the value 200, except it saves storage space.

The major limitation on scaled integers is that the maximum integer value is 2,147,483,643. If scaled to a negative power, this may not be large enough. For example, scaling to -2 for money, gives a maximum value of 21,474,836.43. (For larger money values, use whole cents in a double precision floating point variable:R*8)

Variable Formats

Variables can be defined in terms of an external format. These formats are:

Aw: A string w long. Strings are stored up to this length with trailing blanks compressed so very long fields (up to 4094) can be defined with little overhead. If using strings as key fields or as fields used in secondary indexes, make them as short as possible. If the variable is a categorical variable, the variable is stored as a number that varies in internal size depending on how many entries are in the list of allowed values.

Variable Size

The size of a variable is defined by either the external format or internal storage type:

A variable can be defined in terms of the external characters or format it is input in or displayed as. For example, a date variable with a date format of 'dd/mm/yyyy' takes ten external characters.
A variable can be defined in terms of the internal storage it takes. A date variable, regardless of date format, is stored as a four byte integer.

For most data definition purposes, specify how the variable looks externally and the appropriate internal format is created automatically. However it is as well to understand these transformations.

Variable Internal Formats

Variables are stored as:

I* 1 | 2 | 4: Integers of 1, 2 or 4 bytes.
I*1 have a range of -128 to +123;
I*2 have a range of -32,768 to +32,763;
I*4 have a range of -2,147,483,648 to 2,147,483,643.
R* 4 | 8

Variable Quality Control

Whenever variables are entered, they are checked to make sure that the input conforms to the defined data type, size and format. Several checks may be specified:

Valid Values: Valid Values are a list of values that are allowable for a given variable. Input that does not match the valid value list for a variable is rejected.

Some existing computer files may not have explicit decimal points but may have an implied decimal point at a given position. For example, a format of F4.1 or a scale of (-1) means that the number in these four positions has one decimal place if an explicit decimal point is not quoted. If this field contained "0012", it is read as "1.2", however if it contained 1.234, then 1.234 is the value that would be stored.

The decimal positions describe an implicit input format and an implicit output format. They do not describe the maximum number of decimal positions that can be stored in a field. If a number is defined as D10.0, it can still be computed equal to "0.12345" and .12345 is stored.

On some schema definition commands (VALUE LABELS, VAR RANGES, MISSING VALUES etc.), numbers can be specified. If the number has a decimal component, specify an explicit decimal point and the actual value regardless of any input definition of implied decimals. For example:
MISSING VALUES RESULT (99.99)

SIR/XS Databases

CASE Structured Databases

CASELESS Databases

Variable Documentation