Zen: Tsearch V2 internals

Tsearch2 internals

What is a GiST-based FTS index ?

It's realized as a RD-tree with bit signatures as a key.

GiST based FTS index is lossy, i.e., tuples returned by search should be checked for false drops.

For each document we obtain signature in following way - we hash (crc32) each word into random bit in a n-bit length strings and their superimposition produces n-bit document signature. Because of hashing there is a chance, that some words hashed to the same position and it could be resulted in false hit.

Signatures, calculated for each document in collection, are stored in RD-tree (Russian Doll tree), invented by Hellerstein, which is an adaptation of the R-tree to sets. In our case transitive containment relation realized with superimposed coding (Knuth,1973) of signatures - parent is 'OR'-ed bit-strings of all children. It's clear, that parents tend to be full of '1' (degenerates) and become quite useless because of it's little selectivity. Next technique we use for search is a variation of Bloom filter. Searching performs as a bit comparison of a bit strings represented query and RD-tree entry. If all '1' of both bit strings are in the same position we say that this branch probably contains query, but if there is even one discrepancy we could definitely reject this branch. That's where we get saving and where we again could get false drops.

So, we have two reasons , why we need to mark index as lossy and consequently, should check search results for false drops. The nature of these reasons are different:

hashing and limited length of document signature (the number of unique words influence on the 'quality' of bit string)
the way we store bit strings (the number of documents influence on the 'quality' of index).

Of course, implicitly, index 'quality' depends on 'quality' of each bit string, so both reasons have influence on the performance of search. Checking for false drops could be very costly operations, because we should read tuple from disk (slow) and check if it satisfies query, which performs as a direct comparison of query and tsvector (this operation is fast because tsvector is a sort of ordered array of lexemes).

Conclusion: keep the number of unique words small. Clustering table on tsvector index might help, since we use Hamming distance to gather similar signatures together.

Update: We implemented Gin - Generalized Inverted Index and tsearch2 in PostgreSQL 8.2+ release supports inverted index !

We use Hamming distance to gather similar signatures.

A Bloom filter is a simple space-efficient data structure for representing a set in order to support membership queries. Though the filter will occasionally return a false positive, it will never return a false negative ! Performance of a Bloom filter is a trade-off between the false positive rate and the size.

B.H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM 13(7), 422-426 (1970).
Joseph M. Hellerstein, Avi Pfeffer, THE RD-TREE: An Index Structure for Sets, http://db.cs.berkeley.edu/papers/UW-CS-TR-1252.pdf

Tsearch V2 internals

Projects

Links

Tsearch2 internals