|
Database | Tuning and Efficiency |
Efficiency in an application is difficult to achieve by tuning after the system is developed. If efficiency is a concern, the best time to consider these issues is at the design stage. The first thing to determine is whether efficiency is a major concern and to identify possible areas where these concerns may arise. For example:
MAX REC TYPES
) with the intention of using very few e.g. do not define a MAX REC TYPES 1000
just to use a few record types in various ranges (100+, 200+, 300+) to mean something. For maximum efficiency, start record types at 1 and assigned numbers sequentially.
The MAX REC COUNT
affects the size of each counter (1, 2 or 4 bytes). So a MAX REC TYPES
of 1,000 and a MAX REC TYPES
of 1,000,000 would mean 4K of record counts per CIR.
Common vars are retrieved very efficiently regardless of a particular record type that is being processed. (Note that common vars cannot be used as keys in secondary indexes as they are not physically part of the record.)
Using the default, when a data block fills up, it is split, with half of the records staying in the original block and half going to the new block. Fifty percent is a good figure for active databases. It means however, that as much as fifty percent of the data file may be empty. This may be unacceptable on large databases and on relatively static databases.
The amount of free space is controlled with the loading factor clause on database updating and database creating commands and utilities.
When a database is reloaded or imported, the data is in sequence and the default loading factor is set to .99 to make maximum use of disk space.
The larger the size of the keys, the larger the index. The maximum size keys in any record in a database affects the size of the index. If an application has one record type with a much longer key than all others, try to reduce this if possible. For example, do not have one record type indexed on a 60 character name, if all other record types in the database have unique numbers.
The maximum possible size of the database key or any secondary index is 320 characters.
In a series of record types that share higher level keyfields, each of these records store much of the same key information. It is therefore sensible to minimise the size of these keys.
For example, a text retrieval system might use words to index documents. However a word can be very long and storing these as keys for documents is wasteful of space. Assign each word a number, such that the text of the word is only stored once and all other key indexing is through the word number.
SIR/XS compresses string variables by stripping trailing blanks to hold only the data. Specifying a long maximum length for string variable incurs little overhead provided it is not used in any keys or secondary indexes. Note that the maximum record size is limited to 32k bytes and is tested assuming all strings are at maximum defined length.
The size of integers depends on the maximum value. One byte holds integers in the range of -127 through +123; two bytes holds integers in the range -127*256 through +127*256 (approximately 32,000), four bytes holds integers in the range -127*256*256*256 through +127*256*256*256 (approximately 2,100,000,000).
SIR/XS stores an actual value in a data field to indicate missing values. A variable can have four possible missing values. SIR/XS uses the upper four values on integers for the three missing values that can be specified and the system missing value
Real numbers are stored in 4 or 8 bytes.
For example, a variable with an input format of "I3" requires 2 bytes of storage because any value between -99 and 999 in the input field can be input. If this field contains a 2-digit variable with a leading blank or plus sign (+), specify the format as "1X,I2". This saves one byte of storage space in each data record containing this variable since 2-digit variables are stored in a single byte.
For example, a categorical variable might be a list of the names of American
states. If 'Alabama' were the first entry in the list, when 'Alabama' is input,
'1' is stored.
The list is held in the data dictionary and is searched sequentially. It is a very simple and easy to use facility for short lists that are not updated very often. A categorical variable takes one byte (for up to 123 values) or two bytes for longer lists.
Do not use categorical variables if there are hundreds or thousands of entries, or there is more information about each entry than just the name, or users have to modify the entries, use tables with indexes to store this type of reference data.
For example, suppose a variable
If the precision for the percentage example were 2 digits after the decimal
point, specify an input format of I6 (nnn.nn) allowing for the decimal point and
5 numbers and specify
Even more storage may be saved with
Whenever possible use the keys to retrieve records. To retrieve a single record,
specify the whole key. To retrieve a set of records, specify the high level keys
that define the set. Whenever possible, specify the keys as part of the
retrieval statement, rather than retrieving all the records and testing values
in the program.
Define secondary indexes to provide access to subsets of records. Both databases and tables provide automatic secondary indexes.
Consider sorting the input transactions to ensure that any serial processing happens only once.
Consider adding additional indexes or keys to avoid an application having to do serial searches of records to find those of interest,
One common design issue involves processing records after a certain amount of time has elapsed. For example, sending a letter to all patients who have not attended for six months. Consider a secondary index by date for planned future attendances. Update this at the time the visit data is updated and then the system can process by date rather than serially searching.
For the most efficient processing, sort the data for a batch data input run into the same sequence as the data base key. Sort on :
This way, the batch data input can be accomplished by an almost sequential processing of the data base.
It is efficient to process the records by record type. Each time a new record type is processed, the description of this record type must be loaded from the dictionary. If different record types are processed together and there are multiple records for each case, this saves accesses to the case block but requires multiple access to the dictionary. The most efficient processing depends on the exact mix of input.
When a record is added, it is stored on the correct block in sequence. This means that a block can get full. If this happens SIR/XS creates a second block to store the additional data, and creates another entry in the index. New blocks are created as necessary. A new block is either allocated from existing available blocks or from new blocks at the end of the file. Blocks become available if the data on them is deleted.
SIR/XS holds the case id, record number and key fields as the key. All keys are the same length, which is either the maximum length of a defined key or the
At the lowest level, an index consists of a key and a pointer to the data block that has that key as the lowest value. At the highest level there is a single Master Index block. This contains a key and a pointer to the index block that has that key as the lowest value. If necessary, because of the size of a database, there may be further index levels between the Master index and the lowest level index. When a key is specified, SIR/XS uses the master index (and any other index levels) that point to the lowest level index block that corresponds to the value given and retrieves that data block.
When a block is created on disk, it is assigned a number known as the PRU or physical record unit that can then be used to retrieve the block directly. In operating system terms, a SIR block consists of multiple physical disk blocks since most operating systems write in fixed blocks.
The
Data block size depends on the size of records defined. If there are not any very long records, SIR/XS uses one block (i.e. 2K). SIR/XS tries to allocate a block size that is big enough to hold 4 of the largest records. The largest block size is 32K. If the maximum record length is between 512 bytes and 8k bytes, then SIR/XS allocates a block size between 2K and 32K. A data record is held in one block. That is, a record is not split across blocks so the maximum size for a single record type is 32K.
A loading factor can be specified on a retrieval update, a batch data input or on a utility update run such as RELOAD. This affects the way a full block is split. The factor is a number between 0 and 1 and the default is .5 on updates and .99 on imports and reloads. The most efficient database is one where each block is loaded to the maximum since this minimises the amount of disk space used and makes retrievals more efficient by
reducing the number of disk I/Os. However, a high loading factor for existing blocks can affect the ways that updates work. To take some examples:
Now suppose that a Retrieval Update adds records randomly using a loading factor of .8. At some point a block becomes full. The record being added at that time is inserted in the correct place and 20% of the space on that block is made available by copying those records to a new block and entering that into the index. If the original block is added to further with data that belongs in that block (i.e. with a key lower than an existing record in that block or than the lowest key in the next block) then again it becomes full and again split with the next new block again taking 20%. Thus it is possible under some sequences of updates that many new blocks are only 20% full. If the loading factor were higher, the result would be even worse. Adding data in reverse key sequence with a high loading factor would produce very poor block usage.
The best loading factor depends on the nature of the activity at the time. In general, adding in sequence at the end of the database is best served by factors nearer to 1. Randomly adding data throughout the database is best served by having enough space available for the inserts to work without splitting blocks and, without specific knowledge as to the sequence of updates, a loading factor of .5 should be used.
SIR/XS uses .5 as a default for updates and .99 as a default for reloads and imports.
The actual, exact loading is reported by the
Each key has the same length - its length is either defined implicitly in the schema or by the
The index holds the key of the lowest record in each data block. An index block holds 'n' entries depending on the size of the key. If normal size keys are specified, say up to about 80 bytes, SIR/XS uses the minimum 2K block size; after that SIR/XS increases the index block size. The index block size is always a multiple of the minimum size. If there are very large keys or a very high number of data blocks, SIR/XS increase the size of the index block to cope with this.
To illustrate the way index levels work, assume there are 80 keys per index block. One index block can point to 80 other blocks. If there are less than 80 blocks of data, then there are only two index blocks. The master index and one low level index block. The master index only has one entry. With 81 to 160 data blocks, there are three index blocks, the master index with two entries, one index block for the first 80 data blocks and the second for the next eighty blocks. This continues on until there are 80*80 data blocks, 80 index blocks and one master index block with 80 entries. When the next data block is added, one of the low level index blocks is used to create two new low level blocks. The original low level index block is now a third level index that contains just two entries pointing to the two new low level indexes. As records are added, indexes split as necessary. The third level takes the index capacity up to 80*80*80 data blocks. This process continues as necessary.
At no one point in time is there any major overhead or any need to reorganise the database assuming that none of the limits specified in the schema definition are reached.
Each secondary index is physically very similar to a standard database. It contains index blocks and data blocks. The sizes of these blocks are calculated in a similar way to the block size calculations for standard database blocks to ensure reasonably efficient processing given the size of the secondary index key and the maximum number of records of that type. Each index potentially has different block sizes.
Each record in the data block in a secondary index has the secondary key as the key and contains the standard database key as the data. Thus the size of these data blocks is affected by the size of both keys.
The following discussion refers to the
CIR Size = 5 words plus key 2 = 7. 254
divide by 7 = 36 per block. 1000 cases in database means 1000
CIRs. 1000 divided by 36 means 28 blocks.
CIR = 28 blocks Rec1 = 40
blocks Rec2 = 32 blocks Total = 100 blocks
This shows how many data blocks the database requires and the record types that use the space.
Second level = 2,500/40 = 63 (Rounding
up). Third level = 63/40 = 2. Total required 2,500 + 63 + 2 + 1 = 2,566.
This gives how many index blocks are needed. To translate these two figures into
physical disk blocks or megabytes on a particular operating system, multiply by
the appropriate factors:
Take the data block size and index block size from
This is how much space the data and indexes are going to take for a database.
The Database Index
The index to the data records in a database is built from the key variables. The key of the first record in each data block is in the index. This means that there is some redundancy between data in the records and data in the index.Variable Sizes
It takes more space to store strings than integers. Whenever there is a choice, storing a number is more space efficient. If a string has a defined set of values, either define the variable as an integer and assign value labels or define it as a categorical variable. UNDEFINED.
Schema Specifications
SIR/XS assigns internal formats according to the external format of the data defined in the schema. Disk space can be saved by a careful choice of schema specifications.VAR RANGES
Specify VAR RANGES
if the variable has a narrower range of values than given by the number of digits. The value is used to calculate the minimum number of bytes needed to store the data on disk. For example, specifying a VAR RANGE
of -99 to +99 on a variable where 3 input columns are allowed saves space. Consider a potential saving of disk space by defining a proper VAR RANGE
.CATEGORICAL
Categorical variables offer an efficient way to store strings that are predefined. A categorical variable is a character string that has a limited number of values specified as an ordered list. When the data is input as a string, it is compared to the list and the number that corresponds to the matching position in the list is stored instead of the value. This has the advantage that only valid entries are held and considerable space is saved. In programs and reports, the full string is displayed and retrieved.SCALED VARS
SCALED VARS
stores numbers as integers when they have a predefined number of decimal places. This is more efficient than using floating point R*8 and can be more accurate than R*4.XPCT
that holds a percentage and can have a range of 0 through 100 and a precision of one decimal point: Define XPCT
as integer with an "I4" input format and specify SCALED VARS XPCT (-1)
. On input, supply the data as a number that includes the physical decimal point, i.e. 10.3, 40.0. The XPCT
scaled integer is only going to require two bytes to store (since the maximum physical digits stored are 1000 i.e. 100.0).VAR RANGES (0.00,100.00)
that tells
SIR/XS that 2 bytes are sufficient.SCALED VARS
, on numbers
that are very small but have only a small number of significant digits. For
example, the specific gravity of fluids in the human body (blood, urine, etc.),
are often measured with a 3 digit precision. To maintain precision in
floating point, specify an INPUT FORMAT
of D4.3. SIR/XS would
use 8 bytes of storage because of the precision. If this variable is read as
"I2" integer and specify SCALED VARS (-3)
, 6 bytes is saved
per value and accuracy is preserved. (The variable can hold values up to 32.763
that is ample for an S.G. measurement.) Processing Efficiency
Disk Input/Output (I/O) is the most time consuming operation on a computer and
retrievals should be designed to minimise I/O.Using Keys
The index is used to retrieve records whenever keys are specified in a retrieval
statement. In case structured databases, the index is ordered by case, record
type and by the key variables. In caseless databases, the index is ordered by
record type and then key variables.Efficient On-Line Access
If the key values are known, then data can be retrieved efficiently. Without
keys, an alternative access route is needed. Doing a serial search for
particular records on-line, without knowing the high level keys is a slow
process. Efficient Batch Processing
Batch processing (the running of jobs in a non-interactive way), typically means that a user is not at a terminal waiting for the job to finish. Processing speed tends not to be of the same concern as it is for on-line access. A process that takes 2 minutes as compared to 1 minute is unlikely to be of concern to anyone. However there may be some concerns when processing thousands of transactions that run for hours.Efficiency in Batch Data Input
Batch data input is the loading of data from files into the database through the
batch data input utilities. This can be done interactively or in batch mode. Database Internal Structure
The records in a SIR/XS database are stored in a direct access file with an internal index sequential B-Tree index. The database contains two types of blocks: data blocks and index blocks. Data blocks contain the data records, index blocks contain the information needed to access any record in the data base. Blocks can be in any sequence on the disk. Within one block, records are held in sequence. The first record in each block is indexed.MAX KEY SIZE
specified. Pay attention to the size of the key. A key is held for each data block in the index and the key is held for each record in each data block. Defining a very large key for one record type impacts the overall database size, regardless of the number of occurrences of that record type. Block Organisation
SIR/XS calculates the size of data blocks and index blocks for a particular database based on keysize, maximum record size and maximum numbers of records when it first puts any data into the database. The block size is between a minimum and maximum (from 2K bytes to 32K bytes on all current systems). The data blocks and index blocks in a database may be different sizes though in a particular database all data blocks are the same size and all index blocks are the same size.LIST STATS
command gives information about the database including the 'INDEX/DATA BLOCK SIZE
'. This gives the sizes of the SIR/XS index and data blocks. Sizes are given in double words - eight bytes on current systems. Sizes do not include the control information SIR/XS holds on each block. A logical block of 2K is 256 doublewords. A typical size for logical blocks for small keys and small data records would be 253/254.Data Blocks
Records are stored in blocks in the order of the keys:case 1 CIR of case 1
records of type 1 within case 1
records of type 2 within case 1
...
...
case 2 CIR of case 2
records of type 1 within case 2
records of type 2 within case 2
...
...
case 3 CIR of case 3
...
SIR/XS holds all of the records in a data block in sequence and to do this it constructs an extra key area at the beginning of each record and holds keys there separately from the data. All record key areas are the same length, which is the same as the keys held for the index. Loading Factor
When SIR/XS needs to insert data that does not fit in the original data blocks, it creates a new block and splits the original data leaving some space on each block. The amount of space left on a block when it is split is determined by the "Loading
Factor".Example Loading Factor Effects
If a loading factor of .99 is specified on a RELOAD
, then all the blocks are approximately full. Suppose that Batch Data Input is then used to add a large amount of data at the end of the database, say with an .99% loading factor. Again all blocks are approximately full. (Blocks have to hold whole records and each record is a different length. So when a record does not fit into a block a certain amount of space is left free. This space varies from block to block.)VERIFY FILE
command. The number reported gives an average over all blocks in the database. SIR/XS does not split records across data blocks and each block contains complete records only. A block contains a mix of records. For example an 80% full block in the EMPLOYEE
database might contain data for 3 or 4 employees and as such might have say 4 CIRs, 4 Employee records, 7 Position records and 13 review records. Thus the exact loading of the block depends on the exact mix of records.Index Blocks
The key is comprised of:MAX KEY SIZE
command.Index Levels
There are always at least two levels of index, a Master Index, which is a single index block, and a low level index. There may be up to 6 levels of index. A six level index can point to the number of keys in one index block raised to the power of 6. For example, with 36 keys in a block, a six level index copes with over 2,000,000,000 data blocks.Secondary Indexes
All secondary indexes are held on a separate database file (.sr6). This is created when the first index is created and deleted if the last index is deleted. Size Estimating
Once records have been added to the database, each physical data block contains
a number of different record types. For size estimation, calculating the number
of data blocks each record type would take gives a reasonable estimate of disk
space requirements.
In addition, space is required
for the dictionary and procedures but typically these are relatively small
requirements. See Procedure File
for further details on managing a procedure file.LIST STATS
output listed at the bottom of this page.LIST STATS
, find the data block size in
double words. For example 254.LIST STATS
find the keysize in bytes.
Convert this to double words by dividing by eight and rounding up. For example,
keysize 12 is 2 double words.LIST STATS
.
Take the size of the record in words and add the keysize from step 2 to find the
space this record type takes. For example, "Size in Words" 8 plus key size 2
equals 10.
Next calculate the index size:
For example, if Max Index Entries is 40 and there are 100,000 data blocks.
Bottom Level = 100,000/40 = 2,500. LIST
STATS
. For example, 126. LIST STATS
in double words and convert to physical blocks or bytes. There is a small overhead on each physical block such that the reported size is smaller than the real physical size. For example, a data block of 254 double words is 2K (which is four 512 byte blocks on some Windows file systems). 1,000 254 double word data blocks would take approximately 2 megabytes of disk space.
Index PTBYNAME Verified - Entries 1852
Verify database statistics
---------------------------
Data records on database 60408
Cases on database 1852
Index blocks read 16
Data blocks read 1467
Average index block loading 0.91
Average data block loading 0.97
Warning messages 0
Correctable errors 0
Non-correctable errors 0
Verification complete with no errors
Statistics for HEART
Database name HEART
Creation Date/Time Dec 13, 2005 11:55:40
Last update Date/Time Dec 13, 2005 11:57:15
Update level 1
Average Records per Case 25000
Max/Current Number of Cases 75000/1852
Max/Current Number of Records 1875000000/60408
Max/Current Number of Record Types 117/85
Maximum Input Columns/Lines 136/30
Rectype Columns 1-3
Journaling ON
Encryption ON
Case Id Variable SSNUM (A)
Number of Index Levels 2
Max Entries Per Index Block 101
Index/Data Block Size 1011/1529
Active/Inactive Data Blocks 1467/0
Active/Inactive Index Blocks 16/0
Keysize In Bytes 72
Min/Max Record Size 1/346
Number of Temporary Variables 0
Maximum Number of Data Variables 958
Record Record Number of Maximum Total In Size In Entry Use
No. Name Variables Per Case Database Words Count
---- -------------------------------- --------- -------- -------- ------- ---------
0 CIR 3 1 1852 36 1
1 DEMO 89 1 1852 96 1
2 HOSP 292 100 2124 140 1
3 CLINPRES 958 100 1266 346 1
4 CATH 530 100 399 213 1
......