Zen: Tsearch V2 Readme

Tsearch2 - full text search extension for PostgreSQL

Online version of this document is available

Tsearch2 - is the full text engine, fully integrated into PostgreSQL RDBMS.

Main features

Full online update
Supports multiple table driven configurations
flexible and rich linguistic support (dictionaries, stop words), thesaurus
full multibyte (UTF-8) support
Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd)
Index support (GiST and Gin) with concurrency and recovery support
Rich query language with query rewriting support
Headline support (text fragments with highlighted search terms)
Ability to plug-in custom dictionaries and parsers
Template generator for tsearch2 dictionaries with snowball stemmer support
It is mature (5 years of development)

Tsearch2, in a nutshell, provides FTS operator (contains) for the new data types, representing document (tsvector) and query (tsquery). Table driven configuration allows creation of custom searches using standard SQL commands.

tsvector is a searchable data type, representing document. It is a set of unique words along with their positional information in the document, organized in a special structure optimized for fast access and lookup. Each entry could be labelled to reflect its importance in document.

tsquery is a data type for textual queries with support of boolean operators. It consists of lexemes (optionally labelled) with boolean operators between.

Table driven configuration allows to specify:

parser, which used to break document onto lexemes
what lexemes to index and the way they are processed
dictionaries to be used along with stop words recognition.

OpenFTS vs Tsearch2

OpenFTS is a middleware between application and database. OpenFTS uses tsearch2 as a storage and database engine as a query executor (searching). Everything else, i.e. parsing of documents, query processing, linguistics, carry outs on client side. That's why OpenFTS has its own configuration table (fts_conf) and works with its own set of dictionaries. OpenFTS is more flexible, because it could be used in multi-server architecture with separate machines for repository of documents (documents could be stored in filesystem), database and query engine.

See Documentation Roadmap for links to documentation.

Authors

Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia

Contributors

Robert John Shepherd and Andrew J. Kopciuch submitted "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch v2)
Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2 Reference" and proposed new naming convention for tsearch V2

Limitations

Length of lexeme < 2K
Length of tsvector (lexemes + positions) < 1Mb
The number of lexemes < 4^32
0< Positional information < 16383
No more than 256 positions per lexeme
The number of nodes ( lexemes + operations) in tsquery < 32768

References

GiST development site - http://www.sai.msu.su/~megera/postgres/gist
GiN development - http://www.sigaev.ru/gin/
OpenFTS home page - http://openfts.sourceforge.net/
Mailing list - http://sourceforge.net/mailarchive/forum.php?forum=openfts-general

Documentation Roadmap

Several docs are available from docs/ subdirectory
- "Tsearch V2 Introduction" by Andrew Kopciuch
- "Tsearch2 Guide" by Brandon Rhodes
- "Tsearch2 Reference" by Brandon Rhodes
Readme.gendict in gendict/ subdirectory
- Also, check Gendict tutorial
Check tsearch2 Wiki pages for various documentation

Support

Authors urgently recommend people to use openfts-general or pgsql-general mailing lists for questions and discussions.

Development History

Latest news

To the PostgreSQL 8.2 release we added:

multibyte (UTF-8) support
Thesaurus dictionary
Query rewriting
rank_cd relevation function now support different weights of lexemes
GiN support adds scalability of tsearch2

Pre-tsearch era: Development of OpenFTS began in 2000 after realizing that we need a search engine optimized for online updates with access to metadata from the database. This is essential for online news agencies, web portals, digital libraries, etc. Most search engines available utilize an inverted index which is very fast for searching but very slow for online updates. Incremental updates of an inverted index is a complex engineering task while we needed something light, free and with the ability to access metadata from the database. The last requirement was very important because in a real life application search engine should always consult metadata ( topic, permissions, date range, version, etc.). We extensively use PostgreSQL as a database backend and have no intention to move from it, so the problem was to find a data structure and a fast way to access it. PostgreSQL has rather unique data type for storing sets (think about words) - arrays, but lacks index access to them. During our research we found a paper of Joseph Hellerstein, who introduced an interesting data structure suitable for sets - RD-tree (Russian Doll tree). Further research lead us to the idea to use GiST for implementing RD-tree, but at that time the GiST code was intouched for a long time and contained several bugs. After work on improving GiST for version 7.0.3 of PostgreSQL was done, we were able to implement RD-Tree and use it for index access to arrays of integers. This implementation was ideally suited for small arrays and eliminated complex joins, but was practically useless for indexing large arrays. The next improvement came from an idea to represent a document by a single bit-signature, a so-called superimposed signature (see "Index Structures for Databases Containing Data Items with Set-valued Attributes", 1997, Sven Helmer for details). We developeded the contrib/intarray module and used it for full text indexing.
tsearch v1: It was inconvenient to use integer id's instead of words, so we introduced a new data type called 'txtidx' - a searchable data type (textual) with indexed access. This was a first step of our work on an implementation of a built-in PostgreSQL full text search engine. Even though tsearch v1 had many features of a search engine it lacked configuration support and relevance ranking. People were encouraged to use OpenFTS, which provided relevance ranking based on positional information and flexible configuration. OpenFTS v.0.34 is the last version based on tsearch v1.
tsearch V2: People recognized tsearch as a powerful tool for full text searching and insisted on adding ranking support, better configurability, etc. We already thought about moving most of the features of OpenFTS to tsearch, and in the early 2003 we decided to work on a new version of tsearch. We abandoned auxiliary index tables which were used by OpenFTS to store positional information and modified the txtidx type to store them internally. We added table-driven configuration, support of ispell dictionaries, snowball stemmers and the ability to specify which types of lexemes to index. Now, it's possible to generate headlines of documents with highlighted search terms. These changes make tsearch more user friendly and turn it into a really powerful full text search engine. Brandon Rhodes proposed to rename tsearch functions for consistency and we renamed txtidx type to tsvector and other things as well. To allow users of tsearch v1 smooth upgrade, we named the module as tsearch2. Since version 0.35 OpenFTS uses tsearch2.