Format of files for indexing/searching

This document describes the binary file formats introduced with Opera version 9 for various files used for indexing/searching.

The new format of these files is based on dividing files into blocks of same size. The blocks for one record might be chained if the size of one block isn't enough for the record. This way it's easy to perform transactions on the files.

The format is NOT backwards compatible with any previous Opera versions. Conversions to the new format are performed after an update from an old version.

Format in General

Files consist of chained blocks of the same size. The first block contains headers. Structure of the first block is:

8B pointer to the first free block or 0 if there isn't any free block
4B 3154051328, magical number used for identification of the file
4B size of the blocks
optional headers may follow

The other blocks may either contain valid data or they may be empty. The last block is never empty. Their structure is:

8B pointer to the next block or 0 if this is the last block in the chain
4B/0B size of the current record, this field is present only in the first block of the chain
data follow

.act files

These files contain words and with pointers to to .bt files. The records in them are of fixed size, they always fit to one block. The record is an array of fields. The structure of the field is in table bellow.

4B/8B pointer to the next record (one of DiskChild or MemoryChild is set) or offset (both flags DiskChild and MemoryChild are 0) in the current record
2B flags and parent of the current node; value parent is 0 if the parent is in another branch; bits for flags are:
15 Final this node doesn't have any children
14 IsWord characters up to this one are an indexed word
13 MemoryChild not used on disk
12 DiskChild pointer is a branch number
11 Free this node is unsused
4B word ID, valid if flag IsWord is set

The first field of the array contains number of not free fields in its word ID part and the rest of it is unused.

There is an optional header at offset 16. Its size is 19*4B. It is used to save a status of a generator of pseudo-random numbers.

.bt files

These files contain BTrees of file IDs containing the appropriate word. The word ID from ACT is a number of the first block in a chain of file IDs for the word. File IDs are 32 bit integers. The block structure contains 32 bit number of the rightmost child followed by pairs of file ID and pointer to its left child. The pointers to children are positions of the children in the file divided by size of the blocks.

There are several BTrees in the file. There is no indication of roots, but the roots are pointed from the .act.