Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.sai.msu.su/~megera/postgres/talks/Full-text-search-PostgreSQL-ms-highload-2012.pdf
Дата изменения: Fri Nov 16 21:52:29 2012
Дата индексирования: Tue Feb 5 21:42:44 2013
Кодировка:

Поисковые слова: релятивистское движение
PostgreSQL



Full-text search in DB


Full-text search ­ Find documents, which satisfy query ­ return results in some order (opt.) Requirements to FTS ­ Full integration with DB core




transaction support concurrency and recovery online index

­ ­

Linguistic support Flexibility, Scalability


What is a document ?


Arbitrary textual attribute Combination of textual attributes Could be fully virtual It's a textual result of any SQL, for example ­ join of doc and authors tables
Title || Abstract || Keywords || Body || Author


Text Search Operators


Traditional FTS operators for textual attributes ~, ~*, LIKE, ILIKE

Problems
­ No linguistic support, no stop-words ­ No ranking ­ Sequential scan all documents, slow

Solution
­ Preprocess document in advance ­ Add index support


FTS in PostgreSQL


set of rules how document and query should be transformed to their FTS representations ­ tsvector, tsquery set of functions to obtain tsvector, tsquery from textual data types FTS operators and indexes ranking functions, headline








FTS in PostgreSQL
=# select 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat':: tsquery;

­ ts · · · ­ ts

vector ­ storage for document, optimized for search sorted array of lexemes positional information weights information query ­ textual data type for query · Boolean operators - & | ! () ­ FTS operator tsvector @@ tsquery


FTS features


Full integration with PostgreSQL 27 built-in configurations for 10 languages Support of user-defined FTS configurations Pluggable dictionaries ( ispell, snowball, thesaurus ), parsers Relevance ranking GiST and GIN indexes with concurrency and recovery support Rich query language with query rewriting support






FTS in PostgreSQL


OpenFTS -- 2000, Pg as a storage GiST index -- 2000, thanks Rambler Tsearch -- 2001, contrib:no ranking Tsearch2 -- 2003, contrib:config GIN --2006, thanks, JFG Networks FTS -- 2006, in-core, thanks,EnterpriseDB E-FTS -- Enterprise FTS, thanks ???


ACID overhead is really big :(


Foreign solutions: Sphinx, Solr, Lucene....
­ ­ ­ ­

Crawl database and index (time lag) No access to attributes Additional complexity BUT: Very fast !



Can we improve native FTS ?


Can we improve native FTS ?
1. Find relevant documents
1. Index scan -- usually very fast

2. Calculate ranks for all founded documents
1. Heap scan -- usually slow

3. Sort documents


Can we improve native FTS ?
156676 Wikipedia articles: postgres=# explain analyze SELECT docid, ts_rank(text_vector, to_tsquery('english', 'title')) AS rank FROM ti2 WHERE text_vector @@ to_tsquery('english', 'title') ORDER BY rank DESC LIMIT 3;

Limit (cost=8087.40..8087.41 rows=3 width=282) (actual time=433.750..433.752 rows= -> Sort (cost=8087.40..8206.63 rows=47692 width=282) (actual time=433.749..433 Sort Key: (ts_rank(text_vector, '''titl'''::tsquery)) Sort Method: top-N heapsort Memory: 25kB -> Bitmap Heap Scan on ti2 (cost=529.61..7470.99 rows=47692 width=282) ( Recheck Cond: (text_vector @@ '''titl'''::tsquery) -> Bitmap Index Scan on ti2_index (cost=0.00..517.69 rows=47692 wi Index Cond: (text_vector @@ '''titl'''::tsquery) Total runtime: 433.787 ms


?


Can we improve native FTS ?
156676 Wikipedia articles: postgres=# explain analyze SELECT docid, ts_rank(text_vector, to_tsquery('english', 'title')) AS rank FROM ti2 WHERE text_vector @@ to_tsquery('english', 'title') ORDER BY text_vector>< plainto_tsquery('english','title') LIMIT 3;

What if we have this plan ?
Limit ->

(cost=20.00..21.65 rows=3 width=282) (actual time=18.376..18.427 rows=3 loop Index Scan using ti2_index on ti2 (cost=20.00..26256.30 rows=47692 width=28 Index Cond: (text_vector @@ '''titl'''::tsquery) Order By: (text_vector >< '''titl'''::tsquery) Total runtime: 18.511 ms

We'll be FINE !


We'll be FINE !

Teach index (GIN) to calculate ranks and returns results in sorted order Store positions in index -- no need for tsvector column, use compression to keep index small Change algorithms and interfaces



We'll be FINE !


Additional benefit
­

T(rare_word & frequent_word) ~ T(rare_word)


Inverted Index


Inverted Index

QUERY: compensation accelerometers INDEX: accelerometers 5,10,25,28,30,36,58,59,61,73,74 RESULT: compensation 30,68

30


No positions in index !

Inverted Index in PostgreSQL
E N T R Y T R E E

Posting list Posting tree


Summary of changes
· GIN
­ storage ­ search ­ ORDER BY ­ interface

· planner


GIN structure changes


Add additional information (word positions)


ItemPointer
typedef struct ItemPointerData { BlockIdData ip_blkid; OffsetNumber ip_posid; } typedef struct BlockIdData { uint16 bi_hi; uint16 bi_lo; } BlockIdData;

6 bytes


WordEntryPos
/* * Equivalent to * typedef struct { * uint16 * weight:2, * pos:14; *} */ typedef uint16 WordEntryPos;

2 bytes


BlockIdData compression


OffsetNumber compression

O0-O15 ­ OffsetNumber bits N ­ Additional information NULL bit


WordEntryPos compression

P0-P13 ­ position bits W0,W1 ­ weight bits


Example


GIN algorithm changes


Top-N queries
1. Scan + calc rank 2. Sort 3. Return using gingettuple one by one


Fast scan
entry1 && entry2


GIN interface changes


extractValue
Datum *extractValue ( Datum itemValue, int32 *nkeys, bool **nullFlags, Datum *addInfo, bool *addInfoIsNull )


extractQuery
Datum *extractValue ( Datum query, int32 *nkeys, StrategyNumber n, bool **pmatch, Pointer **extra_data, bool **nullFlags, int32 *searchMode, ???bool **required??? )


consistent
bool consistent ( bool check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool nullFlags[], Datum addInfo[], bool addInfoIsNull[] )


calcRank
float8 calcRank ( bool check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool nullFlags[], Datum addInfo[], bool addInfoIsNull[] )


???joinAddInfo???
Datum joinAddInfo ( Datum addInfos[] )


Planner optimization


Before
test=# EXPLAIN (ANALYZE, VERBOSE) SELECT * FROM test ORDER BY slow_func(x,y) LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------Limit (cost=0.00..3.09 rows=10 width=16) (actual time=11.344..103.443 rows=10 loops=1) Output: x, y, (slow_func(x, y)) -> Index Scan using test_idx on public.test (cost=0.00..309.25 rows=1000 width=16) (actual time=11.341..103.422 rows=10 loops=1) Output: x, y, slow_func(x, y) Total runtime: 103.524 ms (5 rows)


After
test=# EXPLAIN (ANALYZE, VERBOSE) SELECT * FROM test ORDER BY slow_func(x,y) LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------Limit (cost=0.00..3.09 rows=10 width=16) (actual time=0.062..0.093 rows=10 loops=1) Output: x, y -> Index Scan using test_idx on public.test (cost=0.00..309.25 rows=1000 width=16) (actual time=0.058..0.085 rows=10 loops=1) Output: x, y Total runtime: 0.164 ms (5 rows)


Testing results


avito.ru: 6.7 mln. docs


With tsvector column
SELECT itemid, title FROM items WHERE fts @@ plainto_tsquery('russian', ' ') ORDER BY ts_rank(fts, plainto_tsquery('russian', ' ')) DESC LIMIT 10;


With tsvector column without patch,
Limit (cost=2341.92..2341.94 rows=10 width=398) (actual time=38.532..38.5 Buffers: shared hit=12830 -> Sort (cost=2341.92..2343.37 rows=581 width=398) (actual time=38.53 Sort Key: (ts_rank(fts, ''''' & '''''::tsquery)) Sort Method: top-N heapsort Memory: 26kB Buffers: shared hit=12830 -> Bitmap Heap Scan on items (cost=48.50..2329.36 rows=581 widt Recheck Cond: (fts @@ ''''' & '''''::tsquery) Buffers: shared hit=12830 -> Bitmap Index Scan on fts_idx (cost=0.00..48.36 rows=58 Index Cond: (fts @@ ''''' & '''''::tsquery) Buffers: shared hit=116 Total runtime: 3 8 . 5 6 9 m s


With tsvector column with patch,
Limit (cost=40.00..80.28 rows=10 width=400) (actual time=11.528..11.536 Buffers: shared hit=374 -> Index Scan using fts_idx on items (cost=40.00..2863.77 rows=701 Index Cond: (fts @@ ''''' & '''''::tsquery) Order By: (fts >< ''''' & '''''::tsquery) Buffers: shared hit=374 Total runtime: 1 1 . 5 6 1 m s


Without tsvector column
SELECT itemid, title FROM items2 WHERE (setweight(to_tsvector('russian'::regconfig, title), 'A'::"char") || setweight(to_tsvector( 'russian'::regconfig, description), 'B'::"char")) @@ plainto_tsquery('russian', ' ') ORDER BY ts_rank((setweight(to_tsvector('russian'::regconfig, title), 'A'::"char") || setweight(to_tsvector('russian'::regconfig, description), 'B'::"char")), plainto_tsquery('russian', ' ')) DESC LIMIT 10;


Without tsvector column, without patch
Limit (cost=2596.31..2596.33 rows=10 width=372) (actual time=862.520..86 Buffers: shared hit=6875 -> Sort (cost=2596.31..2597.91 rows=642 width=372) (actual time=862.5 Sort Key: (ts_rank((setweight(to_tsvector('russian'::regconfig, Sort Method: top-N heapsort Memory: 26kB Buffers: shared hit=6875 -> Bitmap Heap Scan on items2 (cost=48.98..2582.44 rows=642 wid Recheck Cond: ((setweight(to_tsvector('russian'::regconfig, Buffers: shared hit=6875 -> Bitmap Index Scan on fts_idx2 (cost=0.00..48.82 rows=6 Index Cond: ((setweight(to_tsvector('russian'::regconfi Buffers: shared hit=116 Total runtime: 8 6 2 . 5 5 1 m s


Without tsvector column, with patch

Limit (cost=40.02..80.43 rows=10 width=373) (actual time=11.298..11.304 Buffers: shared hit=374 -> Index Scan using fts_idx2 on items2 (cost=40.02..2771.68 rows=676 Index Cond: ((setweight(to_tsvector('russian'::regconfig, title), Order By: ((setweight(to_tsvector('russian'::regconfig, title), Buffers: shared hit=374 Total runtime: 1 1 . 3 2 1 m s


avito.ru: tests
Without With patch With patch patch without tsvector Table size Index size Index build time Queries in 8 hours 6.0 GB 1.29 GB 216 sec 3,0 mln. 6.0 GB 1.27 GB 303 sec 42.7 mln. 2.87 GB 1.27 GB 718sec 42.7 mln. Sphinx 1.12 GB 180 sec* 32.0 mln.


Anonymous source: 18 mln. docs


Anonymous source: tests
Without patch Table size Index size Index build time Queries in 8 hours 18.2 GB 2.28 GB 258 sec 2.67 mln. With patch 18.2 GB 2.30 GB 684 sec 38.7 mln. With patch without tsvector 11.9 GB 2.30 GB 1712 sec 38.7 mln. Sphinx

3.09 GB 481 sec* 26.7 mln.


!!!


Status & Availability
· 150 Kb patch for 9.3 · Datasets and workloads are welcome


Plans & TODO
· Fix recovery · Fix fastupdate · Fast scan interface · Accelerate index build if possible · Partial match support


Thanks! Sponsors are welcome!