Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://www.sai.msu.su/~megera/wiki/tp_search_en
Дата изменения: Unknown Дата индексирования: Sun Apr 10 17:10:19 2016 Кодировка: Поисковые слова: п п п п п п п п п п п п п п п п п п п п п п п |
The goal of search is to find all objects, ordered by their similarity to a query. In general, similarity function is an arbitrary function depending on semantic of application.
Plain full text search algorithm has problem if textual information of object is represented by a separate text attributes (columns) in table, which searched independently. Highly relevant objects could be easily missed from search results, because no separate attribute could contains all words from query, while combined text string could satisfy search query ! Also, in practice, different attributes are of different importance and resulted similarity (rank) should utilize this information to meet users expectation.
For example, consider hotel search on objects with attributes (ordered by importance):
Query " Moscow's celebrations hotel Moscow, Russia" would failed if search attributes separately, but could be succeeded if search their concatenation. It's possible to use naive approach with somewhat automatic recognition of "keywords" and removing them from query, but it's very complex, error-prone and we'll not consider it. Instead, we'll show how to apply standard full text search approach to such kinds of problem.
If search purpose is to find maximum possible objects, then original query could be rewritten using following rule:
all logical AND are replaced by logical OR except AND in combined keywords.
For example, query "New York Hilton resorts" transforms to "(new & york) | hilton | hotel". Here, "new york" is recognized keyword and should be 'AND'-ed to avoid false hits. ( Notice, that word "resorts" replaced by word "hotel" in our example is just illustration of using of synonym dictionary and doesn't relevant to query rewrite.)
Proximity based ranking functions, which use distance between words for ranking, will automatically rank search results in "right way" and using different weigths for different attributes will correct rank in application specific way.
Tsearch2 is a full text extension to PostgreSQL, which provides almost everything needed for implementing search system described above (only relevant features listed):
See [Tsearch2 notes] for details of usage.
For example, full text index hotelidx for hotel search (see above) for hotel with following attributes:
would be:
'russia:1A moscow:3B,4 celebration:5 hotel:6'
It was obtained using:
setweight( to_tsvector('Russia','A') ) ||\ setweight( to_tsvector('Moscow','B') ||\ to_tsvector('Moscow\s celebrations hotel','B')
Letters "A" and "B", here, designate weights, default values for weight classes are A=1.0, B=0.4, C=0.2, D=0.1 and could be changed in rank function.
Unfortunately, tsearch2 doesn't provides query rewriting and this is what we plan to implement as an additional function.
Load some data:
insert into hotels values ('Moscow''s celebrations hotel', 'Russia', 'Moscow',null); insert into hotels values ('Hilton hotel', 'Russia', 'Moscow', null); insert into hotels values ('Hilton hotel', 'Canada', 'Toronto', null); insert into hotels values ('Moscow hotel', 'Canada', 'Toronto', null); update hotels set idx=setweight( to_tsvector(country), 'A' ||\ setweight(to_tsvector(place), 'B' ) ||\ to_tsvector(name); # select name, country, place from hotels ; name | country | place -----------------------------+---------+--------- Moscow's celebrations hotel | Russia | Moscow Hilton hotel | Russia | Moscow Hilton hotel | Canada | Toronto Moscow hotel | Canada | Toronto
Simple search for "Moscow":
# select name, country, place, rank( idx, ftsquery ) as r from hotels, to_tsquery('Moscow') as ftsquery where idx @@ ftsquery order by r desc; name | country | place | r -----------------------------+---------+---------+------ Moscow's celebrations hotel | Russia | Moscow | 0.46 Hilton hotel | Russia | Moscow | 0.4 Moscow hotel | Canada | Toronto | 0.1 (3 rows)
Query "Moscow hotel" (rewritten by hand):
# select name, country, place, rank( idx, ftsquery ) as r from hotels, to_tsquery('Moscow | hotel') as ftsquery where idx @@ ftsquery order by r desc; name | country | place | r -----------------------------+---------+---------+------- Moscow's celebrations hotel | Russia | Moscow | 0.514 Hilton hotel | Russia | Moscow | 0.46 Moscow hotel | Canada | Toronto | 0.19 Hilton hotel | Canada | Toronto | 0.1 (4 rows)
Notice: