Search Engine

The purpose of Search Engine is to perform fulltext and other indexing and searching. Indexes are disk-based with focus on low memory usage, speed and crash-safety. Use cases inlude m2 indexing, cache management or a fulltext history search.

Some of the classes can be used separately for other purposes, for example TVector, BlockStorage or WordSegmenter.

API

The diagram bellow shows the general structure of this module. See doxygen generated documentation for details or examples in chapter 4. class diagram

Every class has its appropriate .ot file. Unlike some other code, OP_ASSERTs aren't placed on unnecessary places, so whenever you see an assertion, it's a serious problem. Please report it to the module owner.

class BlockStorage

BlockStorage is a crash-safe file encapsulation with transaction support. File is divided into blocks of same size. The details of the block handling are transparent to user, but it's suitable to choose block size close to the size of the stored records. If the size of the block is too big, you loose disk space, if the size of the block is too small, you loose performance. If the record is too big to fit into one block, the blocks are chained. The first block with offset 0 is reserved for internal usage, so the first valid block is the second one. Each block contains a 64 bit pointer to the next block, the first block in this chain also contains a 32 bit length of the data. The reserved block at offset 0 contains a 64 bit pointer to the first free block, 32 bit header and 32 bit size of a block. It may also contain user header data.

template class TVector

TVector is to a big extent compatible with OpVector. In addition, it can also hold non-pointer types like integers or structures an provides methods for sorting and searching in the sorted vector.

class ACT

ACT is a shortcut of Array Compacted Trie. Trie is an array of pointers indexed by appropriate character from a word. Each pointer points to array for next character. See http://www.n3labs.com/pdf/fast-and-space-efficient.pdf for more information about ACTs. The main advantage is that the search time doesn't depend on the size of indexed data.

class ACT is used for storing words and IDs of the words. It's possible to delete words from ACT, but you have to know the word you are going to delete. ACT is built on the top of BlockStorage.

class SingleBTree

SingleBTree is a multiway search tree, B-tree, stored on disk.

class TBtree

TBtree is obtained via class TPool, which serves as a storage for multiple B-trees in one file.

class StringTable

StringTable looks up files contating a given word. It consists of two parts, ACT and DiskBTree, stored in files .act and .bt. ACT contains the words with pointers to DiskBTree. Each word is stored once for all the files together. Every DiskBTree contains file IDs of the files containing the word.

class BSCursor

BSCursor uses BlockStorage to store structured records. Items in the record can have both fixed an varibale length.

class WordSegmenter

WordSegmenter divides plain text into separate words. It contains a recognition of email addresses and URLs as well. Bigrams and trigrams are returned for scripts without spaces between words, such as Chinese.

class VisitedSearch

VisitedSearch is an implementation of Google-like fulltext searching and indexing, returning URLs with other metadata as a result of a search query.

struct RankIndex

RankIndexes manages index files holding part of the fulltext index. VisitedSearch index consists of several such subindexes.

Memory usage

OOM policy

The vast majority of methods returns OP_STATUS to signal OOM. An exception is SearchIterator, which uses its method Error to signal OOM.

OOM handling

Generally all errors, including OOM, are propagated to calling layer. When OOM happens in ACT or DiskBTree, the class tries firstly to solve the problem by cleaning its internal cache.

Heap memory usage

The indexes are disk-based, so they don't require a lot of memory except caches. The caches can be reduced by tweaks on a limited memory device.

Stack memory usage

The stack is used moderately, but there are recursive functions in ACT and BTree. The recursion in BTree isn't so deep, but it might make problems in ACT for long strings (e.g. URLs) during prefix searching. The limit is 100 recursive calls.

Caches

There are caches in VisitedSearch, StringTable, ACT and BTreePool. The caches in VisitedSearch and StringTable are freed explicitly by calling Commit. The cache in ACT and BTreePool, descendants of BSCache, is handled internally. If these classes were used directly, their cache can be freed by calling Flush.

Memory tuning

It's possible to change the maximum size of ACT and BTreePool caches. However, it is not recommended, since the current sizes come from a numeber experiments.

ACT implementation

When indexing or searching for a word, the root TrieBranch is loaded at first. Branch is an array of TrieNodes. The TrieNodes are indexed by characters of the word. So you take the first character and go to characterth position in the root TrieBranch. The TrieNode on this position contains information where to go with the next character. It might point to another branch. In this case you load the branch and repeat the process. The TrieNode for next character might also lay in the same branch. In this case, you read on offset from the current TrieNode and add this offset to the next character to find out the position. Some of the TrieNodes might be free. It means that no word with character appropriate for the position have been indexed. The number of the free nodes is about 60% in average.

See also description of the file formats used by Search Engine.

Code samples

BlockStorage


	BlockStorage bs;
	OpFileLength pos1, pos2;
	int len;

	bs.Open(UNI_L("filename"), BlockStorage::OpenReadWrite, 256);  //size of one block is 256

	pos1 = bs.Write(buf1, 244);  //this will fit into one block
	pos2 = bs.Write(buf2, 256);  //this will go to two blocks
	bs.Delete(pos1);             //delete the first record
	bs.Update(buf3, 300, pos2);  //update the second record

	len = bs.DataLength(pos2);  //length of the record
	
	bs.Read(buf, sizeof(buf), pos2);  //read the data, but don't exceed size of the buffer

	bs.Close();

SingleBTree


	SingleBTree<128, int> b;
	SearchIterator<int> *search_result;
	
	b.Open(UNI_L("filename"), BlockStorage::OpenReadWrite);
	
	b.Insert(5);
	b.Insert(2);
	b.Insert(4);
	
	b.Commit();
	
	search_result = b.Search(2, operatorGT);
	
	if (search_result != NULL && !search_result->Empty())
	do {
		printf("%i", search_result->Get());
	} while (search_result->Next());  // output is 45
	
	delete search_result;

Cursor


	BlockStorage bs;
	BSCursor c(&bs);
	BSCursor::ID id;

	c.AddField("int16", 2);  //2B field
	c.AddField("string", 0);  //variable-length field
	c.AddField("int32", 4);  //4B field

	//block header = 12B, int16+int32 = 6B and the size of the string will be usually less than 41B,
	//hence the block size will be 64B
	bs.Open(UNI_L("filename"), BlockStorage::OpenReadWrite, 64);

	c.Create();                                             //initialize a new record
	c["int16"].SetValue(4);                                 //set the first field
	c[2].SetValue(7);                                       //set the third field
	c["string"].SetStringValue("string of 23 characters");  //set the second field
	c.Flush();                                              //write the record to disk
	id = c.GetID();                                         //you can call GetID after a successfull Flush

	bs.Close();  //don't use c from here

See the .ot files for more examples of usage.