Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://www.sai.msu.su/~megera/postgres/fts/doc/fts-overview.html
Дата изменения: Unknown Дата индексирования: Sun Apr 13 07:48:00 2008 Кодировка: Поисковые слова: mars global surveyor |
Full-Text Search in PostgreSQL: A Gentle Introduction | ||||
---|---|---|---|---|
Prev | Fast Backward | Chapter 1. FTS Introduction | Fast Forward | Next |
Text search operators in database existed for years. PostgreSQL has ~,~*, LIKE, ILIKE operators for textual datatypes, but they lack many essential properties required for modern information system:
there is no linguistic support, even in english, regular expressions are not enough - satisfies -> satisfy, for example. You may miss documents, which contains word satisfies, although certainly would love to find them when search for satisfy. It is possible to use OR to search any of them, but it's boring and ineffective (some words could have several thousands of derivatives).
they provide no ordering (ranking) of search results, which makes them a bit useless, unless there are only a few documents found.
they tends to be slow, since they process all documents every time and there is no index support.
The improvements to the FTS came from the idea to preprocess document at index time to save time later, at a search stage. Preprocessing includes:
Parsing document to lexemes. It's useful to distinguish various kinds of lexemes, for example, digits, words, complex words, email address, since different types of lexemes can be processed different. It's useless to attempt normalize email address using morphological dictionary of russian language, but looks reasonable to pick out domain name and be able to search for domain name. In principle, actual types of lexemes depend on specific applications, but for plain search it's desirable to have predefined common types of lexemes.
Applying linguistic rules to normalize lexeme to their infinitive form, so one should not bother entering search word in specific form. Taking into account type of lexeme obtained before provides rich possibilities for normalization.
Store preprocessed document in a way, optimized for searching, for example, represent document as a sorted array of lexemes. Along with lexemes itself it's desirable to store positional information to use it for proximity ranking, so that document which contains more "dense" region with query words assigned a higher rank than one with query words scattered all over.
PostgreSQL is an extendable database, so it's natural to introduce a new data types (Section 1.2.1) tsvector for storing preprocessed document and tsquery for textual queries. Also, full-text search operator (FTS) @@ is defined for these data types (Section 1.2.2). FTS operator can be accelerated using indices (Section 2.7).
tsvector is a data type, which represents document, and optimized for FTS. In simple phrase, tsvector is a sorted list of lexemes, so even without index support full text search should performs better than standard ~,LIKE operators.
=# select 'a fat cat sat on a mat and ate a fat rat'::tsvector; tsvector ---------------------------------------------------- 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
Notice, that space is also lexeme !
=# select 'space '' '' is a lexeme'::tsvector; tsvector ---------------------------------- 'a' 'is' ' ' 'space' 'lexeme'
Each lexeme, optionally, could have positional information, which used for proximity ranking.
=# select 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector; tsvector ------------------------------------------------------------------------------- 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
Each position of a lexeme can be labeled by one of 'A','B','C','D', where 'D' is default. These labels can be used to indicate group membership of lexeme with different importance or rank, for example, reflect document structure. Actually, labels are just a way to differentiate lexemes. Actual values will be assigned at search time and used for calculation of document rank. This is very convenient to control and tune search machine.
Concatenation operator - tsvector || tsvector "constructs" document from several parts. The order is important if tsvector contains positional information. Of course, using SQL join operator, it is possible to "build" document using different tables.
=# select 'fat:1 cat:2'::tsvector || 'fat:1 rat:2'::tsvector; ?column? --------------------------- 'cat':2 'fat':1,3 'rat':4 =# select 'fat:1 rat:2'::tsvector || 'fat:1 cat:2'::tsvector; ?column? --------------------------- 'cat':4 'fat':1,3 'rat':2
Tsquery is a data type for textual queries with support of boolean operators - & (AND), | (OR), parenthesis. Tsquery consists of lexemes (optionally labeled by letter[s]) with boolean operators between.
=# select 'fat & cat'::tsquery; tsquery --------------- 'fat' & 'cat' =# select 'fat:ab & cat'::tsquery; tsquery ------------------ 'fat':AB & 'cat'
Labels could be used to restrict search region, which allows to develop different search engines using the same full text index.
tsqueries could be concatenated using && (AND-ed) and || (OR-ed) operators.
test=# select 'a & b'::tsquery && 'c|d'::tsquery; ?column? --------------------------- 'a' & 'b' & ( 'c' | 'd' ) test=# select 'a & b'::tsquery || 'c|d'::tsquery; ?column? --------------------------- 'a' & 'b' | ( 'c' | 'd' )
FTS in PostgreSQL provides operator @@ for the two data types - tsquery and tsvector, which represents, correspondingly, document and query. Also, FTS operator has support of TEXT,VARCHAR data types, which allows to setup simple full-text search, but without ranking support.
tsvector @@ tsquery tsquery @@ tsvector text|varchar @@ text|tsquery
Full text search operator @@ returns TRUE if tsvector contains tsquery.
=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------- t =# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------- f