Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.sai.msu.su/~megera/postgres/talks/2009.pdf
Äàòà èçìåíåíèÿ: Thu May 21 22:20:53 2009
Äàòà èíäåêñèðîâàíèÿ: Sat Oct 17 04:16:26 2009
Êîäèðîâêà:
Some recent advances in full-text search
Oleg Bartunov, Teodor Sigaev
Sternberg Astronomical Institute, Moscow State University, Russia

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Talk roadmap



Full-text search introduction Main topics
­ ­



Phrase Search Dictionaries API



New features (already in 8.4) Future features Tips and Tricks





Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Full-text search in PostgreSQL
=# select 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat':: tsquery;

· tsvector ­ storage for document ­ sorted array of lexemes with optional positional and weight information · tsquery ­ textual data type for query ­ Boolean operators - & | ! () ''telefonsvarer' ='> lefon' & telefonsvarer' | te · FTS operator tsvector @@ tsquery · to_tsvector, to_tsquery, plainto_tsquery · Indexes: GiST, GIN
Oleg Bartunov, Teodor Sigaev

'svar'

PGCon, Ottawa, May 21-22,2009


Talk roadmap



Full-text search introduction Main topics
­ ­



Phrase Search Dictionaries API



New features (already in 8.4) Future features Tips and Tricks





Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Phrase search - definition
A $ B: word 'A' followed by 'B': - A & B (the same priority) - exists at least one pair of positions PB, PA , so that 0 PB ­ PA 1 (distance condition) A $[n] B: 0 PB ­ PA n Result of operation: - false - true and array of positions of left argument which satisfy distance condition (without positional information $ is equivalent to &) $ is very similar to & except: A $ B B $ A
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Phrase search - properties
'A $[n] B $[m] C' '(A $[n] B) $[m] C' matched phrase length max(n, m) Note: 'A C B' matched by '(A $[2] B) $ C' 'A $[n] (B $[m] C') matched phrase length n + m Note: Order is preserved for any n, m 'A $[0] B' matches the word with two different forms ( infinitives )
=# S t - --{bo t o_t E s o s L _ k v E l i e C e n c T x g t i , o t z b r s e o ( _lexi ze('is pell ','b ooking s'); ---ok} 'book ings') @@ 'boo king $ [0] b ook': :tsque ry
PGCon, Ottawa, May 21-22,2009

Oleg Bartunov, Teodor Sigaev


Phrase search - practice

Phrase: - 'A B C' 'A $ (B $ C)' - 'A B C' '(A $ B) $[2] C' - TSQUERY phraseto_tsearch([CFG,] TEXT) Stop-words: 'A the B' 'A $[2] B' What shall we do with complex queries? A $ ( B & ( C | ! D ) ???

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Phrase search - internals

Phrase search has overhead, since it requires access and operations on posting lists To avoid slowdown of existing tsearch, executor of tsquery should not access positions without necessity. To facilitate this, any $ operations pushed down in query tree, so tsearch executor can call special phrase executor for the top $ operation, which will work only with query tree containing only $ operations.

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Phrase search - transformation
( (A $ B) $ (C | D) ) & F & $ $ A B C F | D A
Oleg Bartunov, Teodor Sigaev

& | $ $ B C A $ B F $

Regular tree

Phrase top D

Phrase tree

PGCon, Ottawa, May 21-22,2009


Phrase search - push down
a $ (b&c) => (a$b) & (a$c) (a&b) $ c => (a$c) & (b$c) a $ (b|c) => (a$b) | (a$c) (a|b) $ c => (a$c) | (b$c) a $ !b => a & !(a$b) there is no position of A followed by B !a $ b => b & !(a$b) there is no position of B precedenced by A

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Phrase search - transformation

# select '( A | B ) $ ( D | C )'::tsquery; tsquery ----------------------------------------------'A' $ 'D' | 'B' $ 'D' | 'A' $ 'C' | 'B' $ 'C' # select 'A $ ( B & ( C | ! D ) )'::tsquery; tsquery -----------------------------------------------------( 'A' $ 'B' ) & ( 'A' $ 'C' | 'A' & !( 'A' $ 'D' ) )

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Phrase search - example

'PostgreSQL can be extended by the user in many ways' ->
# select phraseto_tsquery('PostgreSQL can be extended by the user in many ways'); phraseto_tsquery --------------------------------------------------------------------'postgresql' $[3] ( 'extend' $[3] ( 'user' $[2] ( 'mani' $ 'way' ) ) )

Can be written by hand:
'postgresql' $[3] extend $[6] user $[8] mani $[9] way

Difficult to modify, use phraseto_tsquery() function !

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Phrase search - TODO
Ranking functions Headline generation Rewrite subsystem Concatenation of two tsquery by $ operation: $$ ? - like other concatenations: &&, || and !! - distance $$[2] !, functional interface ? Need testing for agglutinative languages (norwegian, german, etc)

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Talk roadmap



Full-text search introduction Main topics
­ ­



Phrase Search Dictionaries API



New features (already in 8.4) Future features Tips and Tricks





Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Dictionaries
Lexeme's type asciiword int float # \dF+ Text se Parser: T ------asciih asciiw email file .... eng arc "p oke --wor ord lis hc g_c n --d Dict #1 Dict #2 Dict #N synonym en_ispell en_stem simple real_dict rat .de Dic --eng eng sim sim ion fau tio --lis lis ple ple "p lt" nar --h_s h_s g_catalog.english" ies ---tem tem

h onfigu atalog | ----+| | | |

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Dictionaries - examples

Integers

'123456789' -> '123456' 'XIX' -> '19'

Roman numbers

Colours

'FFFFFF' -> 'white' H(\s|-)?(alpha|beta|gamma) h$2 -- spectral lines of hydrogen
PGCon, Ottawa, May 21-22,2009

Regexp

Oleg Bartunov, Teodor Sigaev


Dictionaries - interface
void* dictInit(List *dictoptions) - list of dictoptions actually contains list of DefElem structures (see headers) - returns pointer to the palloc'ed dictionary structure - Can be expensive (ispell) TSLexeme* dictLexize( void* dictData, // returned by dictInit() char* lexeme, // not zero-terminated int lenlexeme, DictSubState *substate // optional );
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Dictionaries ­ output
typedef struct { uint16 nvariant; // optional uint16 flags; // optional char *lexeme; } TSLexeme; dictLexize returns NULL ­ dictionary doesn't recognize the lexeme dictLexize returns array of TSLexeme (last element TSLexeme->lexeme is NULL) dictLexize returns empty array ­ dictionary recognizes the lexeme, but it's a stop-word
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Dictionaries - output

SELECT ts_lexize('en_ispell','bookings'); TSLexeme array: # nvariant flags 0 0 0 1 0 0 2 0 0

lexeme booking book NULL

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Agglutinative Languages

German, norwegian, ... http://en.wikipedia.org/wiki/Agglutinative_language

Concatenation of words without space Query - Fotballklubber Document - Klubb on fotballfield How to find document ? Split words and build search query 'fotbalklubber' => ' ( fotball & klubb ) | ( fot & ball & klubb ) '
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Dictionaries - TSLexeme->nvariant
Agglutinative languages have several variants of word's splitting: Word 'foobarcom' (imaginary)
Lexeme foo bar com foob rcom nvariant 1 1 1 2 2 Comment

-a- is an affix (interfix)

tsvector: 'bar:1 com:1 foo:1 foob:1 rcom:1' tsquery: '(foob & rcom) | (foo & bar & com)'

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Dictionaries - output

Each TSLexeme describes one normalized lexeme TSLexeme->flags is an OR-ed: - TSL_PREFIX indicates to use prefix search for this lexeme Note: dictionaries are planned for 8.5 - TSL_ADDPOS points to parser to increase position's counter Note: currently only thesaurus dictionary uses it

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Dictionaries ­ output (imaginary)
Result of 'foobar' Variant #1 lexeme: foo flags: 0 nvariant: 1 lexeme: bar flags: TSL_PREFIX nvariant: 1 Lexeme: oobar Flags: TSL_PREFIX | TSL_ADDPOS nvariant:2 Variant #2

Lexeme: first Flags: 0 nvariant:2

tsvector: 'foo:1 bar:1 first:1 oobar:2' tsquery: '(foo & bar:*) | (first & oobar:*)'
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Dictionaries - several words

typedef struct { bool

// // bool getnext; // // void *private; // // // } DictSubState;

isend;

in: marks end of text (input lexeme is invalid!) out: dictionary asks for a next lexeme internal state of dictionary while it's asking a next lexeme

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Dictionaries ­ several words
Current lexeme Set next lexeme as current lexeme dictLexize Y Result? getnext? Save stored result and go to next lexeme Y Has stored result? N Go to the next dictionary with current lexeme
PGCon, Ottawa, May 21-22,2009

Store result and current lexeme
Oleg Bartunov, Teodor Sigaev

N

Y

N


Dictionaries ­ filter for 8.5

New TSLexeme->flags: TSL_FILTER If dictionary returns only one lexeme with TSL_FILTER flag, then that lexeme will be used as an input for the subsequent dictionaries in the chain.

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Filter dictionary ­ unaccent (8.5)
contrib/unaccent provides unaccent text search dictionary and function to remove accents (suffix tree, ~ 25x faster translate() solution)
1. Unaccent dictionary does nothing and returns NULL. (lexeme 'Hotels' will be passed to the next dictionary if any) =# select ts_lexize('unaccent','Hotels') is NULL; ?column? ---------t 2. Unaccent dictionary removes accent and returns 'Hotel'. (lexeme 'Hotel' will be passed to the next dictionary if any) =# select ts_lexize('unaccent','HÒtel'); ts_lexize ---------{Hotel}

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Filter dictionary - unaccent
CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french ); ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPING FOR hword, hword_part, word WITH unaccent, french_stem; =# select to_tsvector('fr','HÒtel de la Mer') @@ to_tsquery('fr','Hotels'); ?column? ---------t Finally, unaccent dictionary solves the known problem with headline ! ( to_tsvector(remove_accent(document)) works with search, but

has problem with highlighting )

=# select ts_headline('fr','HÒtel de la Mer',to_tsquery('fr','Hotels')); ts_headline -----------------------HÒtel de la Mer

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Talk roadmap



Full-text search introduction Main topics
­ ­



Phrase Search Dictionaries API



New features (already in 8.4) Future features Tips and Tricks





Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


New features and improvements
ts_headline() enhancements (8.4) Prefix full-text search support (8.4) Devanagari script support (8.4) dict_xsyn improvement ts_stat() performance improvement (8.4) Fast approximated statistics (8.3,8.4) GIN improvements: fast update (8.4), partial match support (8.4), multicolumn (8.4) contrib/btree_gin (8.4)
PGCon, Ottawa, May 21-22,2009

















Oleg Bartunov, Teodor Sigaev


ts_headline enhancement



New parameter MaxFragments by Sushant Sinha. Default is 0, ts_headline() generates one fragment

=# select ts_headline($$ Text from http://www.postgresql.org/docs/8.3/static/history.html $$, plainto_tsquery('postgresql postgres '), 'MaxFragments=3, MinWords=3, MaxWords=6'); ts_headline ------------------------------------------------------------------------------------------------PostgreSQL is derived from the POSTGRES ... behind it, PostgreSQL ... PostgreSQL as "Postgres" (now rarely
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Prefix full-text search support



Prefix full-text search support
­

to_tsquery('supernov:*') - match all documents, which contains words with prefix 'supernov' to_tsquery('supernov:ab*') - the same,but only in titles (weight 'a') and keywords (weight 'b') Can use new GIN partial match feature to speedup search Can be useful if there is no stemmer available
PGCon, Ottawa, May 21-22,2009

­

­

­

Oleg Bartunov, Teodor Sigaev


Devanagari script support

Postg reSQL 8.3- has prob lem wi th De vanag ari s crip t (htt p://e n.wiki pedi a.org /wiki /Deva nagar i - sc ript for Hindi , Mar athi, Nep ali, Sanscr it,.. .).
select * from ts_parse('default',' 2

); '

12 2 12 2 12 2 12 2



Madan Puraskar Pustakalaya

Virama sign (modifier, suppresses inherent vowel ) ­ punct in np_NP locale. Breaks all parsers, which use locale.

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Devanagari script support
( Madan Puraskar Pustakalaya )
character 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 byte 0 3 6 9 10 13 16 19 22 25 28 31 34 35 38 41 44 47 50 53 56 59 UTF-32 00092E 000926 000928 000020 00092A 000941 000930 000938 00094D 000915 00093E 000930 000020 00092A 000941 000938 00094D 000924 000915 00093E 000932 00092F encoded as E0 A4 AE E0 A4 A6 E0 A4 A8 20 E0 A4 AA E0 A5 81 E0 A4 B0 E0 A4 B8 E0 A5 8D E0 A4 95 E0 A4 BE E0 A4 B0 20 E0 A4 AA E0 A5 81 E0 A4 B8 E0 A5 8D E0 A4 A4 E0 A4 95 E0 A4 BE E0 A4 B2 E0 A4 AF glyph name DEVANAGARI DEVANAGARI DEVANAGARI SPACE DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI SPACE DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI DEVANAGARI LETTER MA LETTER DA LETTER NA LETTER PA VOWEL SIGN U LETTER RA LETTER SA SIGN VIRAMA LETTER KA VOWEL SIGN AA LETTER RA LETTER PA VOWEL SIGN U LETTER SA SIGN VIRAMA LETTER TA LETTER KA VOWEL SIGN AA LETTER LA LETTER YA

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Devanagari script support
8.4 knows Virama signs =# select * from ts_parse('default', ' ); ' tokid | token -------+--------2 | 12 | 2 | 12 | 2 | (5 rows) Thanks to Dibyendra Hyoju and Bal Krishna Bal for testing and valuable discussion
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Devanagari script support



TODO
­ ­

Port stemmer for nepali to snowball Improve Hunspell support (recognize more flags in affix file)

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Synonym dictionary with prefix search support (8.5)
cat $SHAREDIR/tsearch_data/synonym_sample.syn postgres pgsql postgresql pgsql postgre pgsql gogle googl indices index* =# create text search dictionary syn ( template=synonym,synonyms='synonym_sample'); =# select ts_lexize('syn','indices'); ts_lexize -- - -- -- - -- {index}

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Synonym dictionary with prefix search support (8.5)
=# create text search configuration tst ( copy=simple); =# alter text search configuration tst alter mapping for asciiword with syn; =# select to_tsquery('tst','indices'); to_tsquery - - -- -- - -- - -'index':* =# select 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices'); ?column? - - -- -- - -- t
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


dict_xsyn improvement



How to search for 'William' and any synonyms 'Will', 'Bill', 'Billy' ? We can:
­ ­ ­

Index only synonyms Index synonyms and original name Index only original name - replace all synonyms. Index size is minimal, but search for specific name is impossible.

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


dict_xsyn improvement



Old version of dict_xsyn can return only list of synonyms. It's possible to prepare synonym file to support other options:
William Will Bill Billy Will William Bill Billy Bill William Will Billy Billy William Will Bill



New dict_xsyn (Sergey Karpov) allows better control:
C REATE TEXT SEARC H DI CTION ARY x syn (RU LES=' xsyn_s ample ', KE EPORIG =fa lse| true, mod e='SIM PLE|SY MMETR IC|MAP ');

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


dict_xsyn improvement



Mode SIMPLE - accepts the original word and returns all synonyms as OR-ed list. This is default mode. Mode SYMMETRIC - accepts the original word or any of its synonyms, and return all others as OR-ed list. Mode MAP - accepts any synonym and returns the original word.





Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


dict_xsyn improvement
EXAMPLES: =# ALTER TEXT SEARCH DICTIONARY xsyn (RULES='xsyn_sample', KEEPORIG=false, mode='SYMMETRIC'); =# select ts_lexize('xsyn','Will') as Will, ts_lexize('xsyn','Bill') as Bill, ts_lexize('xsyn','Billy') as Billy; will | bill | billy ----------------------+----------------------+--------------------{william,bill,billy} | {william,will,billy} | {william,will,bill} Mode='MAP' will | bill | billy -----------+-----------+----------{william} | {william} | {william} Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


ts_stat() performance !



ts_stat() function gathers words s from tsvectors ­ now uses binary of sorted arrays (probably, better rbtree to defense against skewed

tatistics tree instead to use data)

Dataset with geonames, total 5,793,013 rows with 2,404,197 unique names: =# select * into ts_stat2 from ts_stat('select fts from spots'); 8.3: 66405972.737 ms CVS HEAD: 25506.736 ms
Oleg Bartunov, Teodor Sigaev

2600x faster !

PGCon, Ottawa, May 21-22,2009


Fast approximated statistics



Gevel extension -- GiST/GIN indexes explorer
(http://www.sai.msu.su/~megera/wiki/Gevel)



Fast -- uses only GIN index (no table access) Approximated -- no table access, which contains visibility information, approx. for long posting lists For mostly read-only data error is small





Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Fast approximated statistics
Top-5 most frequent words (463,873 docs)
=# SELECT * FROM gin_stat('gin_idx') as t(word text, ndoc int) order by ndoc desc limit 5; word | ndoc --------+-------page | 340858 figur | 240366 use | 148022 model | 134442 result | 129010 (5 rows) Time: 520.714 ms



Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Fast approximated statistics


gin_stat() vs ts_stat()
=# select * into stat from ts_stat('select fts from papers') order by ndoc desc, nentry desc,word; ...wait.... 68704,182 ms =# SELECT a.word, b.ndoc as exact, a.estimation as estimation, round ( (a.estimation-b.ndoc)*100.0/a.estimation,2)||'%' as error FROM (SELECT * FROM gin_stat('gin_x_idx') as t(word text, estimation int) order by estimation desc limit 5 ) as a, stat b WHERE a.word = b.word; word | exact | estimation | error --------+--------+------------+------page | 340430 | 340858 | 0.13% figur | 240104 | 240366 | 0.11% use | 147132 | 148022 | 0.60% model | 133444 | 134442 | 0.74% result | 128977 | 129010 | 0.03% (5 rows) Time: 550.562 ms

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


GIN improvements



GIN fast update (8.4) GIN partial match support (8.4) GIN multicolumn index (8.4) contrib/btree_gin (8.4) ­ provides GIN operator classes, that implement B-tree for all data types. Useful to use with GIN multicolumn feature: CREATE index fts_idx ON papers USING gin(timestamp, fts_tsvector);
PGCon, Ottawa, May 21-22,2009







Oleg Bartunov, Teodor Sigaev


Talk roadmap



Full-text search introduction Main topics
­ ­



Phrase Search Dictionaries API



New features (already in 8.4) Future features Tips and Tricks





Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Future features
Red-Black tree experiment to replace binary tree in GIN ­ better defense against skewed data.

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Future features


Red-Black tree experiment to replace binary tree in GIN ­ better defense against skewed data ­ motivational example by Sergey Burladyan
http://archives.postgresql.org/pgsql-performance/2009-03/msg00340.php

create table a (i1 int, i2 int, i3 int, i4 int, i5 int, i6 int); insert into a select n, n, n, n, n, n from generate_series(1, 100000) as n; create index arr_gin on a using gin ( (array[i1, i2, i3, i4, i5, i6]) ); truncate a; drop index arr_gin ; create index arr_gin on a using gin ( (array[i1, i2, i3, i4, i5, i6]) ); insert into a select n, n, n, n, n, n from generate_series(1, 100000) as n;

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Red-Black Tree


8.3.5

­ binary tree



8.4beta1 - binary tree + limit 8.4beta1+Red-Black tree
(self-balancing binary search tree, the longest path from any node to a leaf is no more than twice the shortest path)



8.3.5 -- - -- -- - -- - -- -- -- - -- -- - -- - -- -- -- - -- -- - -index (bulk): 123276.419 index+insert: 3415.676
Oleg Bartunov, Teodor Sigaev

8.4beta1 8.4beta1+rbtree - -- -- -- - -- -- - -- - -- -- -- - -- -- - -- - -- -- -- - -- -- 2686.435 2075.634 2900.268 2708.512
PGCon, Ottawa, May 21-22,2009


Red-Black Tree
select array_to_string(ARRAY(select '' || c || '.' || b from generate_series(1,50) b), ' ')::tsvector AS i INTO a FROM generate_series(1,100000) c; create index arr_gin on a using gin (i); drop table a; create table a ( i tsvector); '2.1' '2.10' '2.11' '2.12' '2.13' '2.14' .... create index arr_gin on a using gin (i); insert into a select array_to_string(ARRAY(select '' || c || '.' || b from generate_series(1,50) b), ' ')::tsvector AS i FROM generate_series(1,100000) c;

'1.1' '1.10' '1.11' '1.12' '1.13' '1.14' ....

8.3.5 - -- -- - -- - -- -- - -- -- -- - -- - -- -- - -- -- -- - -- - -- index (bulk): inf (>1night) index+insert: 410300.855
Epaper archive:
Oleg Bartunov, Teodor Sigaev

8.4beta1 - - -- -- -- - -- - -- -- - -- 228564.291 314332.507
81714.308

8.4beta1+rbtree - -- - -- - -- -- - -- -- -- - -- - -- -- - -- -- 152569.763 251015.830
86312.517

PGCon, Ottawa, May 21-22,2009


Downloads (CVS HEAD)


Phrase search
­

http://www.sigaev.ru/misc/phrase_search-0.7.gz



Filter dictionary support
­
­

http://www.sigaev.ru/misc/filter_dict-0.2.gz
http://www.sigaev.ru/misc/unaccent-0.2.tar.gz http://www.sigaev.ru/misc/synonym_prefix.gz http://www.sigaev.ru/misc/rbtree-0.2.gz



Synonym dictionary with prefix search
­



Red-Black tree
­

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Polygons

A B A @> B = TRUE

B A

A && B = TRUE

http://www.sigaev.ru/misc/polygon-0.1.gz
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Talk roadmap



Full-text search introduction Main topics
­ ­



Phrase Search Dictionaries API



New features (already in 8.4) Future features Tips and Tricks





Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Full-text search tips



Aggregate for tsvector Stable to_tsquery Find documents with specific token type Getting words from tsvector Confuse with text search









Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Aggregate for tsvector

CREATE AGGREGATE tsvector_sum(tsvector) ( SFUNC = tsvector_concat, STYPE = tsvector, INITCOND = '' ); =# SELECT tsvector_sum( t.fts) FROM ( select ('1 2 ' || generate_series(3,10,1))::tsvector AS fts ) AS t; tsvector_sum -- - -- -- - -- - --- - -- - -- -- - -- - --- - -- - -- -- - -- - '1' '2' '3' '4' '5' '6' '7' '8' '9' '10'

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Stable to_tsquery
Result of to_tsquery() can't be used as a cache key, since ts_query() does preserve an order, which isn't good for cacheing. Little function helps:
C REAT R ETUR $$ SE $$ L ANGU E OR RE PLACE FUN CTION stab le_ts_ query (tsqu ery) NS tsqu ery A S L ECT t s_rewr ite( $1 , 'dum my_wo rd', 'dummy _wor d'); AG E SQL RETU RNS NULL ON NU LL INP UT IM MUTAB LE;

Note: Remember about text search configuraton to have really good cache key !
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Find documents with specific token type
How to find documents, which contain emails ?
C REATE OR R EPLA CE FUN CTION docu ment_t oken _typ es(tex t) R ETURNS _te xt A S $$ S ELECT AR SEL EC D FRO M t t WHE RE t ); $ $ LANG UA RA Y ( T IS TINC T ali as s_ toke n_typ e('de fault' ) AS tt, s_ pars e('de fault ', $1) AS t p t. toki d = t p.tok id GE SQL immut able;
PGCon, Ottawa, May 21-22,2009

Oleg Bartunov, Teodor Sigaev


Find documents with specific token type
= # SE LECT docum ent_to k L IMIT 1 0; d - ---- ----- ----- ------ {asc iihw ord,as ciiwo rd {asc iiwo rd,bla nk} {asc iiwo rd,bla nk} {asc iiwo rd,bla nk} {asc iiwo rd,bla nk} {asc iiwo rd,bla nk,fl oa {asc iiwo rd,bla nk} {asc iihw ord,as ciiwo rd {asc iiwo rd,bla nk} {asc iiwo rd,bla nk} ( 10 r ows) en_t ypes( titl e) FR OM pa pers ocum ent_t oken _type s ---- ----- ---- ----- ----- ------ ----- ----- -,bl ank,hw ord_ asci ipart}

t,h ost} ,bl ank,hw ord_ asci ipart, int,n umwor d,uint }

C REAT E IND EX ft s_type s_idx ON p aper s USI NG gin ( doc umen t_tok en_typ es (t itle) );
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Find documents with specific token type

How to find documents, which contain emails ?
S ELECT comme nt FR OM pap ers W HERE d ocume nt_to ken_ty pes( titl e) && '{ema il}';

The list of available token types:
S ELECT * FRO M ts_ token_ type ('de fault' );

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009


Getting words from tsvector
C O R S ION ts_s ent ry in $$ ' | | quo || $ $ LANG UAGE SQL R ETURN S NU LL ON S ELECT id | - ---+1| 1| 1| 1| 1| 1| 1| R U E E E T T L AT n UR EC E d N T OR oc i S SE ts_ R n T s E t O t P e F a LAC ger re t(' E , c S F O or EL U U d E NC T Tn AS CT t at(ts vecto r, OUT wor d tex t, teger) te_lit eral( $1:: text ) '::ts vector '); NULL INPUT IMMU TABLE ;

id, ( ts_st at(ft s)). * FRO M apod WHER E id= 1; wo rd | ndo c | nentr y - ----- ----- +---- --+- ----- -1 | 1| 1 2 | 1| 2 io | 1| 2 m ay | 1| 1 n ew | 1| 1 r ed | 1| 1 t wo | 1| 1
PGCon, Ottawa, May 21-22,2009

Oleg Bartunov, Teodor Sigaev


Confuse with text search
One expected true here, but result is disappointing false =# select to_tsquery('ob_1','inferences') @@ to_tsvector('ob_1','inference'); ?column? - -- -- -- - -f

Use ts_debug() to understand the problem
'in feren ces': {fr ench_ ispell ,fren ch_s tem} | fre nch_s tem | {in fere nt}

'in feren ce': {fr ench_ ispell ,fren ch_s tem} | fre nch_i spell | {in fere nce}
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009


Confuse with text search



Use synonym dictionary as a first dictionary {synonym,french_ispell,french_stem} with rule 'inferences inference'
­

Don't forget to reindex ! Don't need to reindex



Use ts_rewrite()
­

Oleg Bartunov, Teodor Sigaev

PGCon, Ottawa, May 21-22,2009




Our work was supported by
­ ­ ­

Russian Foundation for Basic Research EnterpriseDB jfg://networks

THANKS !
Oleg Bartunov, Teodor Sigaev PGCon, Ottawa, May 21-22,2009