Database

Home	Starting	Environment	DBMS	VisualPQL	Procedures	SQL	Forms	Host/API	Index

Database

Tuning and Efficiency

The information in this topic covers the way SIR/XS manages data internally. You do not need to know this to use SIR/XS successfully or to implement SIR/XS systems. It is here if required for database design on large or complex systems, or tuning particular applications.

Efficiency in an application is difficult to achieve by tuning after the system is developed. If efficiency is a concern, the best time to consider these issues is at the design stage. The first thing to determine is whether efficiency is a major concern and to identify possible areas where these concerns may arise. For example:

Data Entry.
Is data entry interactive ?
How many people are going to require simultaneous access ?
What sort of turnaround is required ?
What response times are necessary ?
Regular Batch Processes.
Are there major processing tasks that must happen every day, week, month ?
Are there reports that must be done very frequently ?
Are there reports where the results must be available immediately ?
On-line Queries.
How is the data accessed ?
Are inquiries always on particular records or are sets of records retrieved that match given conditions ?
How up to date does the data need to be ?
Disk Space.
Is the amount of data to hold relatively trivial for this size machine or is it going to have a major impact on disk storage ?
Can disk storage be ignored for all practical purposes or do you have to have to make the most efficient use possible ?
Recovery.
How is data going to be protected in case of problems ?
What concurrent updates are going to happen ?
What journaling strategy is appropriate ?

Disk Space

The amount of space that a database occupies on the disk can be a concern with larger databases. There are often trade-offs between processing efficiency and storage and there are several things that can be done to limit the size of databases.

The MAX REC COUNT affects the size of each counter (1, 2 or 4 bytes). So a MAX REC TYPES of 1,000 and a MAX REC TYPES of 1,000,000 would mean 4K of record counts per CIR.

Common Vars

Defining a variable as a common var means that it is physically stored in the CIR, not in a record. Very often, there is only one occurrence of a common var in a case, so storage is identical whether held as a common or a record variable. If a common var is defined in a record that does occur multiple times, it is only physically stored once, this value being the latest value written.

Common vars are retrieved very efficiently regardless of a particular record type that is being processed. (Note that common vars cannot be used as keys in secondary indexes as they are not physically part of the record.)

The Loading Factor

Records are stored in "data blocks", the exact size of a data block varying from database to database. Records are added to a data block, and when that gets filled up, another block is created and so on. Records are maintained in sequence within a block and some empty space is left on each data block for the insertion of new records. The amount of space on each data block is controlled by the loading factor and is expressed as a decimal representing a percentage. The default loading factor on updates is .5 or 50 percent.

Using the default, when a data block fills up, it is split, with half of the records staying in the original block and half going to the new block. Fifty percent is a good figure for active databases. It means however, that as much as fifty percent of the data file may be empty. This may be unacceptable on large databases and on relatively static databases.

The amount of free space is controlled with the loading factor clause on database updating and database creating commands and utilities.

When a database is reloaded or imported, the data is in sequence and the default loading factor is set to .99 to make maximum use of disk space.

The Database Index

The index to the data records in a database is built from the key variables. The key of the first record in each data block is in the index. This means that there is some redundancy between data in the records and data in the index.

The larger the size of the keys, the larger the index. The maximum size keys in any record in a database affects the size of the index. If an application has one record type with a much longer key than all others, try to reduce this if possible. For example, do not have one record type indexed on a 60 character name, if all other record types in the database have unique numbers.

The maximum possible size of the database key or any secondary index is 320 characters.

In a series of record types that share higher level keyfields, each of these records store much of the same key information. It is therefore sensible to minimise the size of these keys.

For example, a text retrieval system might use words to index documents. However a word can be very long and storing these as keys for documents is wasteful of space. Assign each word a number, such that the text of the word is only stored once and all other key indexing is through the word number.

Variable Sizes

It takes more space to store strings than integers. Whenever there is a choice, storing a number is more space efficient. If a string has a defined set of values, either define the variable as an integer and assign value labels or define it as a categorical variable.

SIR/XS compresses string variables by stripping trailing blanks to hold only the data. Specifying a long maximum length for string variable incurs little overhead provided it is not used in any keys or secondary indexes. Note that the maximum record size is limited to 32k bytes and is tested assuming all strings are at maximum defined length.

The size of integers depends on the maximum value. One byte holds integers in the range of -127 through +123; two bytes holds integers in the range -127*256 through +127*256 (approximately 32,000), four bytes holds integers in the range -127*256*256*256 through +127*256*256*256 (approximately 2,100,000,000).

SIR/XS stores an actual value in a data field to indicate missing values. A variable can have four possible missing values. SIR/XS uses the upper four values on integers for the three missing values that can be specified and the system missing value UNDEFINED.

Real numbers are stored in 4 or 8 bytes.

Schema Specifications

SIR/XS assigns internal formats according to the external format of the data defined in the schema. Disk space can be saved by a careful choice of schema specifications.

For example, a variable with an input format of "I3" requires 2 bytes of storage because any value between -99 and 999 in the input field can be input. If this field contains a 2-digit variable with a leading blank or plus sign (+), specify the format as "1X,I2". This saves one byte of storage space in each data record containing this variable since 2-digit variables are stored in a single byte.

VAR RANGES

Specify VAR RANGES if the variable has a narrower range of values than given by the number of digits. The value is used to calculate the minimum number of bytes needed to store the data on disk. For example, specifying a VAR RANGE of -99 to +99 on a variable where 3 input columns are allowed saves space. Consider a potential saving of disk space by defining a proper VAR RANGE.

CATEGORICAL

Categorical variables offer an efficient way to store strings that are predefined. A categorical variable is a character string that has a limited number of values specified as an ordered list. When the data is input as a string, it is compared to the list and the number that corresponds to the matching position in the list is stored instead of the value. This has the advantage that only valid entries are held and considerable space is saved. In programs and reports, the full string is displayed and retrieved.

For example, a categorical variable might be a list of the names of American states. If 'Alabama' were the first entry in the list, when 'Alabama' is input, '1' is stored.

The list is held in the data dictionary and is searched sequentially. It is a very simple and easy to use facility for short lists that are not updated very often. A categorical variable takes one byte (for up to 123 values) or two bytes for longer lists.

Do not use categorical variables if there are hundreds or thousands of entries, or there is more information about each entry than just the name, or users have to modify the entries, use tables with indexes to store this type of reference data.

SCALED VARS

SCALED VARS stores numbers as integers when they have a predefined number of decimal places. This is more efficient than using floating point R*8 and can be more accurate than R*4.

For example, suppose a variable XPCT that holds a percentage and can have a range of 0 through 100 and a precision of one decimal point: Define XPCT as integer with an "I4" input format and specify SCALED VARS XPCT (-1). On input, supply the data as a number that includes the physical decimal point, i.e. 10.3, 40.0. The XPCT scaled integer is only going to require two bytes to store (since the maximum physical digits stored are 1000 i.e. 100.0).

If the precision for the percentage example were 2 digits after the decimal point, specify an input format of I6 (nnn.nn) allowing for the decimal point and 5 numbers and specify VAR RANGES (0.00,100.00) that tells SIR/XS that 2 bytes are sufficient.

Even more storage may be saved with SCALED VARS, on numbers that are very small but have only a small number of significant digits. For example, the specific gravity of fluids in the human body (blood, urine, etc.), are often measured with a 3 digit precision. To maintain precision in floating point, specify an INPUT FORMAT of D4.3. SIR/XS would use 8 bytes of storage because of the precision. If this variable is read as "I2" integer and specify SCALED VARS (-3), 6 bytes is saved per value and accuracy is preserved. (The variable can hold values up to 32.763 that is ample for an S.G. measurement.)

Processing Efficiency

Disk Input/Output (I/O) is the most time consuming operation on a computer and retrievals should be designed to minimise I/O.

Using Keys

The index is used to retrieve records whenever keys are specified in a retrieval statement. In case structured databases, the index is ordered by case, record type and by the key variables. In caseless databases, the index is ordered by record type and then key variables.

Whenever possible use the keys to retrieve records. To retrieve a single record, specify the whole key. To retrieve a set of records, specify the high level keys that define the set. Whenever possible, specify the keys as part of the retrieval statement, rather than retrieving all the records and testing values in the program.

Efficient On-Line Access

If the key values are known, then data can be retrieved efficiently. Without keys, an alternative access route is needed. Doing a serial search for particular records on-line, without knowing the high level keys is a slow process.

Define secondary indexes to provide access to subsets of records. Both databases and tables provide automatic secondary indexes.

Efficient Batch Processing

Batch processing (the running of jobs in a non-interactive way), typically means that a user is not at a terminal waiting for the job to finish. Processing speed tends not to be of the same concern as it is for on-line access. A process that takes 2 minutes as compared to 1 minute is unlikely to be of concern to anyone. However there may be some concerns when processing thousands of transactions that run for hours.

Consider sorting the input transactions to ensure that any serial processing happens only once.

Consider adding additional indexes or keys to avoid an application having to do serial searches of records to find those of interest,

One common design issue involves processing records after a certain amount of time has elapsed. For example, sending a letter to all patients who have not attended for six months. Consider a secondary index by date for planned future attendances. Update this at the time the visit data is updated and then the system can process by date rather than serially searching.

Efficiency in Batch Data Input

Batch data input is the loading of data from files into the database through the batch data input utilities. This can be done interactively or in batch mode.

For the most efficient processing, sort the data for a batch data input run into the same sequence as the data base key. Sort on :

the CASE ID
the record number
the key fields

This way, the batch data input can be accomplished by an almost sequential processing of the data base.

It is efficient to process the records by record type. Each time a new record type is processed, the description of this record type must be loaded from the dictionary. If different record types are processed together and there are multiple records for each case, this saves accesses to the case block but requires multiple access to the dictionary. The most efficient processing depends on the exact mix of input.

Database Internal Structure

The records in a SIR/XS database are stored in a direct access file with an internal index sequential B-Tree index. The database contains two types of blocks: data blocks and index blocks. Data blocks contain the data records, index blocks contain the information needed to access any record in the data base. Blocks can be in any sequence on the disk. Within one block, records are held in sequence. The first record in each block is indexed.

When a record is added, it is stored on the correct block in sequence. This means that a block can get full. If this happens SIR/XS creates a second block to store the additional data, and creates another entry in the index. New blocks are created as necessary. A new block is either allocated from existing available blocks or from new blocks at the end of the file. Blocks become available if the data on them is deleted.

SIR/XS holds the case id, record number and key fields as the key. All keys are the same length, which is either the maximum length of a defined key or the MAX KEY SIZE specified. Pay attention to the size of the key. A key is held for each data block in the index and the key is held for each record in each data block. Defining a very large key for one record type impacts the overall database size, regardless of the number of occurrences of that record type.

At the lowest level, an index consists of a key and a pointer to the data block that has that key as the lowest value. At the highest level there is a single Master Index block. This contains a key and a pointer to the index block that has that key as the lowest value. If necessary, because of the size of a database, there may be further index levels between the Master index and the lowest level index. When a key is specified, SIR/XS uses the master index (and any other index levels) that point to the lowest level index block that corresponds to the value given and retrieves that data block.

Block Organisation

SIR/XS calculates the size of data blocks and index blocks for a particular database based on keysize, maximum record size and maximum numbers of records when it first puts any data into the database. The block size is between a minimum and maximum (from 2K bytes to 32K bytes on all current systems). The data blocks and index blocks in a database may be different sizes though in a particular database all data blocks are the same size and all index blocks are the same size.

When a block is created on disk, it is assigned a number known as the PRU or physical record unit that can then be used to retrieve the block directly. In operating system terms, a SIR block consists of multiple physical disk blocks since most operating systems write in fixed blocks.

The LIST STATS command gives information about the database including the 'INDEX/DATA BLOCK SIZE'. This gives the sizes of the SIR/XS index and data blocks. Sizes are given in double words - eight bytes on current systems. Sizes do not include the control information SIR/XS holds on each block. A logical block of 2K is 256 doublewords. A typical size for logical blocks for small keys and small data records would be 253/254.

Data Blocks

Records are stored in blocks in the order of the keys:

case 1    CIR of case 1
records of type 1 within case 1
records of type 2 within case 1
...
...
case 2    CIR of case 2
records of type 1 within case 2
records of type 2 within case 2
...
...
case 3    CIR of case 3
...

SIR/XS holds all of the records in a data block in sequence and to do this it constructs an extra key area at the beginning of each record and holds keys there separately from the data. All record key areas are the same length, which is the same as the keys held for the index.

Data block size depends on the size of records defined. If there are not any very long records, SIR/XS uses one block (i.e. 2K). SIR/XS tries to allocate a block size that is big enough to hold 4 of the largest records. The largest block size is 32K. If the maximum record length is between 512 bytes and 8k bytes, then SIR/XS allocates a block size between 2K and 32K. A data record is held in one block. That is, a record is not split across blocks so the maximum size for a single record type is 32K.

Loading Factor

When SIR/XS needs to insert data that does not fit in the original data blocks, it creates a new block and splits the original data leaving some space on each block. The amount of space left on a block when it is split is determined by the "Loading Factor".

A loading factor can be specified on a retrieval update, a batch data input or on a utility update run such as RELOAD. This affects the way a full block is split. The factor is a number between 0 and 1 and the default is .5 on updates and .99 on imports and reloads. The most efficient database is one where each block is loaded to the maximum since this minimises the amount of disk space used and makes retrievals more efficient by reducing the number of disk I/Os. However, a high loading factor for existing blocks can affect the ways that updates work. To take some examples:

Example Loading Factor Effects

If a loading factor of .99 is specified on a RELOAD, then all the blocks are approximately full. Suppose that Batch Data Input is then used to add a large amount of data at the end of the database, say with an .99% loading factor. Again all blocks are approximately full. (Blocks have to hold whole records and each record is a different length. So when a record does not fit into a block a certain amount of space is left free. This space varies from block to block.)

Now suppose that a Retrieval Update adds records randomly using a loading factor of .8. At some point a block becomes full. The record being added at that time is inserted in the correct place and 20% of the space on that block is made available by copying those records to a new block and entering that into the index. If the original block is added to further with data that belongs in that block (i.e. with a key lower than an existing record in that block or than the lowest key in the next block) then again it becomes full and again split with the next new block again taking 20%. Thus it is possible under some sequences of updates that many new blocks are only 20% full. If the loading factor were higher, the result would be even worse. Adding data in reverse key sequence with a high loading factor would produce very poor block usage.

The best loading factor depends on the nature of the activity at the time. In general, adding in sequence at the end of the database is best served by factors nearer to 1. Randomly adding data throughout the database is best served by having enough space available for the inserts to work without splitting blocks and, without specific knowledge as to the sequence of updates, a loading factor of .5 should be used.

SIR/XS uses .5 as a default for updates and .99 as a default for reloads and imports.

The actual, exact loading is reported by the VERIFY FILE command. The number reported gives an average over all blocks in the database. SIR/XS does not split records across data blocks and each block contains complete records only. A block contains a mix of records. For example an 80% full block in the EMPLOYEE database might contain data for 3 or 4 employees and as such might have say 4 CIRs, 4 Employee records, 7 Position records and 13 review records. Thus the exact loading of the block depends on the exact mix of records.

Index Blocks

The key is comprised of:

the CASE ID
the record type number (0 for the CIR)
the Key Fields

Each key has the same length - its length is either defined implicitly in the schema or by the MAX KEY SIZE command.

The index holds the key of the lowest record in each data block. An index block holds 'n' entries depending on the size of the key. If normal size keys are specified, say up to about 80 bytes, SIR/XS uses the minimum 2K block size; after that SIR/XS increases the index block size. The index block size is always a multiple of the minimum size. If there are very large keys or a very high number of data blocks, SIR/XS increase the size of the index block to cope with this.

Index Levels

There are always at least two levels of index, a Master Index, which is a single index block, and a low level index. There may be up to 6 levels of index. A six level index can point to the number of keys in one index block raised to the power of 6. For example, with 36 keys in a block, a six level index copes with over 2,000,000,000 data blocks.

To illustrate the way index levels work, assume there are 80 keys per index block. One index block can point to 80 other blocks. If there are less than 80 blocks of data, then there are only two index blocks. The master index and one low level index block. The master index only has one entry. With 81 to 160 data blocks, there are three index blocks, the master index with two entries, one index block for the first 80 data blocks and the second for the next eighty blocks. This continues on until there are 80*80 data blocks, 80 index blocks and one master index block with 80 entries. When the next data block is added, one of the low level index blocks is used to create two new low level blocks. The original low level index block is now a third level index that contains just two entries pointing to the two new low level indexes. As records are added, indexes split as necessary. The third level takes the index capacity up to 80*80*80 data blocks. This process continues as necessary.

At no one point in time is there any major overhead or any need to reorganise the database assuming that none of the limits specified in the schema definition are reached.

Secondary Indexes

All secondary indexes are held on a separate database file (.sr6). This is created when the first index is created and deleted if the last index is deleted.

Each secondary index is physically very similar to a standard database. It contains index blocks and data blocks. The sizes of these blocks are calculated in a similar way to the block size calculations for standard database blocks to ensure reasonably efficient processing given the size of the secondary index key and the maximum number of records of that type. Each index potentially has different block sizes.

Each record in the data block in a secondary index has the secondary key as the key and contains the standard database key as the data. Thus the size of these data blocks is affected by the size of both keys.

Size Estimating

Once records have been added to the database, each physical data block contains a number of different record types. For size estimation, calculating the number of data blocks each record type would take gives a reasonable estimate of disk space requirements. In addition, space is required for the dictionary and procedures but typically these are relatively small requirements. See Procedure File for further details on managing a procedure file.

The following discussion refers to the LIST STATS output listed at the bottom of this page.

From the LIST STATS, find the data block size in double words. For example 254.
From the LIST STATS find the keysize in bytes. Convert this to double words by dividing by eight and rounding up. For example, keysize 12 is 2 double words.
Take the first record type in the LIST STATS. Take the size of the record in words and add the keysize from step 2 to find the space this record type takes. For example, "Size in Words" 8 plus key size 2 equals 10.
Take the data block size and divide by the length from step 3. This gives the number of that record type that fit in one data block. For example, record size 10, block size 254 gives 25 records per block.
Divide the total number of records in the database by the number per block to find the number of blocks needed. For example 1000 records at 25 per block need 40 blocks.
Repeat steps 3 thru 5 for each record type plus the CIR. (One CIR per case). For example:
CIR Size = 5 words plus key 2 = 7.
254 divide by 7 = 36 per block.
1000 cases in database means 1000 CIRs.
1000 divided by 36 means 28 blocks.
Add up all the blocks needed for each record type to get the total data blocks required. For example,
CIR = 28 blocks
Rec1 = 40 blocks
Rec2 = 32 blocks
Total = 100 blocks

This shows how many data blocks the database requires and the record types that use the space.

Apply the 'loading factor'. If using 50%, then twice as many data blocks are needed; if using 80%, then 20% more are needed. For example, 100 blocks at 80% require 120 blocks.

Next calculate the index size:

Take the Max Entries per Index Level from the LIST STATS. For example, 126.
Take the number of data blocks and divide by the Max Index Entries. This gives the number of lowest level index blocks. For example 120 blocks divided by index entries of 126 gives 1.
If the number of index blocks so far is less than the Max Index Entries, then do not do this step. If the number is more than the Max Index Entries then divide the answer by the max index entries to get the number of second level index blocks. If this number is greater than the max index entries, divide again to get the third level index blocks. Repeat this until the answer is less than Max Index Entries. Add up all levels.
Add 1 for the master level.

For example, if Max Index Entries is 40 and there are 100,000 data blocks. Bottom Level = 100,000/40 = 2,500.

Second level = 2,500/40 = 63 (Rounding up).

Third level = 63/40 = 2.

Total required 2,500 + 63 + 2 + 1 = 2,566.

This gives how many index blocks are needed. To translate these two figures into physical disk blocks or megabytes on a particular operating system, multiply by the appropriate factors:

Take the data block size and index block size from LIST STATS in double words and convert to physical blocks or bytes. There is a small overhead on each physical block such that the reported size is smaller than the real physical size. For example, a data block of 254 double words is 2K (which is four 512 byte blocks on some Windows file systems). 1,000 254 double word data blocks would take approximately 2 megabytes of disk space.

This is how much space the data and indexes are going to take for a database.

Sample VERIFY FILE output

Index PTBYNAME                         Verified - Entries 1852
Verify database statistics
---------------------------
Data records on database               60408
Cases on database                      1852
Index blocks read                      16
Data blocks read                       1467

Average index block loading            0.91
Average data block loading             0.97
Warning messages                       0
Correctable errors                     0
Non-correctable errors                 0

Verification complete  with no errors

Sample LIST STATS Output

Statistics for  HEART
Database name                          HEART

Creation Date/Time                     Dec 13, 2005      11:55:40
Last update Date/Time                  Dec 13, 2005      11:57:15
Update level                           1

Average Records per Case               25000
Max/Current Number of Cases            75000/1852
Max/Current Number of Records          1875000000/60408

Max/Current Number of Record Types     117/85
Maximum Input Columns/Lines            136/30
Rectype Columns                        1-3
Journaling                             ON
Encryption                             ON
Case Id Variable                       SSNUM (A)

Number of Index Levels                 2
Max Entries Per Index Block            101
Index/Data Block Size                  1011/1529
Active/Inactive Data Blocks            1467/0
Active/Inactive Index Blocks           16/0

Keysize In Bytes                       72
Min/Max Record Size                    1/346
Number of Temporary Variables          0
Maximum Number of Data Variables       958

Record Record                            Number of  Maximum   Total In  Size In  Entry Use
 No.   Name                              Variables  Per Case  Database  Words     Count
----   --------------------------------  ---------  --------  --------  -------  ---------
   0   CIR                                       3         1      1852       36          1
   1   DEMO                                     89         1      1852       96          1
   2   HOSP                                    292       100      2124      140          1
   3   CLINPRES                                958       100      1266      346          1
   4   CATH                                    530       100       399      213          1
......