Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.sai.msu.su/~megera/wiki/tp_search_en
Дата изменения: Unknown
Дата индексирования: Sun Apr 10 17:10:19 2016
Кодировка:

Поисковые слова: virgo cluster
Zen: tp search en

tp search en

FTS with most maximum completeness and relevancy algorithm using tsearch2

The Theory

The goal of search is to find all objects, ordered by their similarity to a query. In general, similarity function is an arbitrary function depending on semantic of application.

Plain full text search algorithm has problem if textual information of object is represented by a separate text attributes (columns) in table, which searched independently. Highly relevant objects could be easily missed from search results, because no separate attribute could contains all words from query, while combined text string could satisfy search query ! Also, in practice, different attributes are of different importance and resulted similarity (rank) should utilize this information to meet users expectation.

For example, consider hotel search on objects with attributes (ordered by importance):

  • country name
  • place name
  • hotel name
  • hotel description

Query " Moscow's celebrations hotel Moscow, Russia" would failed if search attributes separately, but could be succeeded if search their concatenation. It's possible to use naive approach with somewhat automatic recognition of "keywords" and removing them from query, but it's very complex, error-prone and we'll not consider it. Instead, we'll show how to apply standard full text search approach to such kinds of problem.

If search purpose is to find maximum possible objects, then original query could be rewritten using following rule:

all logical AND are replaced by logical OR except AND in combined keywords.

For example, query "New York Hilton resorts" transforms to "(new & york) | hilton | hotel". Here, "new york" is recognized keyword and should be 'AND'-ed to avoid false hits. ( Notice, that word "resorts" replaced by word "hotel" in our example is just illustration of using of synonym dictionary and doesn't relevant to query rewrite.)

Proximity based ranking functions, which use distance between words for ranking, will automatically rank search results in "right way" and using different weigths for different attributes will correct rank in application specific way.

The Practice

Tsearch2 is a full text extension to PostgreSQL, which provides almost everything needed for implementing search system described above (only relevant features listed):

  • to_tsvector function could be used to produce combined full text index
  • setweight function could be used to specify different weights for attributes.Notice, that exact value of "weights" are specified in rank function !
  • rank function calculates rank of objects based on proximity, frequency of query word occurence and other heuristics.

See [Tsearch2 notes] for details of usage.

For example, full text index hotelidx for hotel search (see above) for hotel with following attributes:

  • "russia" is a country name,
  • "moscow" - place name,
  • "moscow celebration hotel" - hotel name
  • hotel description is empty

would be:

'russia:1A moscow:3B,4 celebration:5 hotel:6'

It was obtained using:

setweight( to_tsvector('Russia','A') ) ||\
setweight( to_tsvector('Moscow','B')   ||\
to_tsvector('Moscow\s celebrations hotel','B')

Letters "A" and "B", here, designate weights, default values for weight classes are A=1.0, B=0.4, C=0.2, D=0.1 and could be changed in rank function.

Unfortunately, tsearch2 doesn't provides query rewriting and this is what we plan to implement as an additional function.

The Example

Load some data:

insert into hotels values ('Moscow''s celebrations hotel', 'Russia', 'Moscow',null);
insert into hotels values ('Hilton hotel', 'Russia', 'Moscow', null);
insert into hotels values ('Hilton hotel', 'Canada', 'Toronto', null);
insert into hotels values ('Moscow hotel', 'Canada', 'Toronto', null);
update hotels set idx=setweight( to_tsvector(country), 'A' ||\
                      setweight(to_tsvector(place), 'B' )  ||\
                                to_tsvector(name);

# select name, country, place from hotels ;
             name             | country |  place
-----------------------------+---------+---------
  Moscow's celebrations hotel | Russia  | Moscow
  Hilton hotel                | Russia  | Moscow
  Hilton hotel                | Canada  | Toronto
  Moscow hotel                | Canada  | Toronto

Simple search for "Moscow":

# select name, country, place, rank( idx, ftsquery ) as r from hotels,
to_tsquery('Moscow') as ftsquery where idx @@ ftsquery order by r desc;
             name             | country |  place  |  r
-----------------------------+---------+---------+------
  Moscow's celebrations hotel | Russia  | Moscow  | 0.46
  Hilton hotel                | Russia  | Moscow  |  0.4
  Moscow hotel                | Canada  | Toronto |  0.1
(3 rows)

Query "Moscow hotel" (rewritten by hand):

# select name, country, place, rank( idx, ftsquery ) as r from hotels,
to_tsquery('Moscow | hotel') as ftsquery where idx @@ ftsquery order by r desc;
             name             | country |  place  |   r
-----------------------------+---------+---------+-------
  Moscow's celebrations hotel | Russia  | Moscow  | 0.514
  Hilton hotel                | Russia  | Moscow  |  0.46
  Moscow hotel                | Canada  | Toronto |  0.19
  Hilton hotel                | Canada  | Toronto |   0.1
(4 rows)

Notice:

  • It's recommended to use different tsearch2 configurations for different attributes. For example, for hotel search it's normal to ignore word "hotel", i.e. put it into list of stop words, or synonym dictionary for place search could know synonyms for "st - saint".
  • tsearch2 provides support for morphology and stemming, which could improve completeness of search results.

The Case Study

tp search doc