Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.sai.msu.su/~megera/postgres/talks/algebra-fts.pdf
Äàòà èçìåíåíèÿ: Mon Jan 12 19:03:48 2009
Äàòà èíäåêñèðîâàíèÿ: Thu Jan 15 01:34:14 2009
Êîäèðîâêà:

Ïîèñêîâûå ñëîâà: m 2
Algebra for full-text queries
Oleg Bartunov, Teodor Sigaev, Mikhail Prokhorov Moscow University, Sternberg Astronomical Institute 2008


FTS features






Full integration with PostgreSQL 16 built-in configurations for 15 languages Support of user-defined FTS configurations Pluggable dictionaries (ispell, snowball, thesaurus ) and parsers Full multibyte support (UTF-8) Relevance ranking Two types of indexes ­ GiST and GIN, both supports concurrency and recovery Rich query language with query rewriting support


FTS in PostgreSQL
=# select 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat':: tsquery;

­ ts · · ·

vector ­ storage for document, optimized for search sorted array of lexemes positional information weights information · tsquery ­ textual data type for query · Boolean operators - & | ! () ­ FTS operator tsvector @@ tsquery


What is a Document ?


Any textual attributes or their combination Should have unique id Textual result of any SQL command


Can be virtual entity
id --Title -----Abstract ----------Keywords ------------Body

Did ---Aid ----

id -Author ---------

Title || Abstract || Keyw ords || Body || Author


(BNF)


http: //en .wiki pedia .org/ wiki/B ackus -Nau r_For m







::= { } | "'" { }"'" :: = | "\" "'" /* sing le qu ote * / | "'" "'" /* singl e qu ote */ | ::= < digit >{ } ::= A |B|C| D


- tsvector








::= { < spa | /* empty */ : := [ ":" :: = { " : [
ce > }

< posit ions > ] ," } ig ht> ]


- tsquery






::= | /* empty */ : := " | | "!" [: :: = "*" | {
&" " io io od

ex pres sion> | < expr ession > n n " )" i ie rs>]

< " > > f

e ight> }[ "*" ]


Examples


tsvector
'star':1B,3A,15,26B,32B,44 'supernovae':2A,25B,31B 'star':1B,3A,15,26B,32B,44 'supernovae':25B,31B



tsquery
'supernovae':A & 'star':A* (match 1st tsvector)


Functions


text -> tsvector


to_tsvector([conf], text) to_tsquery([conf],text) plainto_tsquery([conf],text)



text -> tsquery




Configuration (conf) defines rules for document->text (parser, dict<->word map)



word -> lexeme


ts_lexize(dict,text)


text -> tsquery






' & '::tsquery as is to_tsquery(' & ') => ' & ' plainto_tsquery(' ') => ' & ' &


DOCUMENT

doc->tsvector
FTS Configuration

PARSER (token, token_type) dicts(token_type) YES i= 0 ask DICT[i] NO YES IS ST OP ? YES NO tsvector i= i+ 1 i

QUERY Supernovae & stars

QPARSER QUERYTREE

text->tsquery
&

Foreach leaf node
PARSER

Supernovae

stars

(token, token_type)
dicts (token_type)

{supernova,sn} NO YES & I NO supernova sn

star

YES i=0
? DICT[i] i=i+1

i
YES YES
IS STOP ?

NO NO QUERYTREE

star

TSQUERY

(supernova | sn) & star


Dictionaries


Dictionary ­ returns an array of void array, NULL, if it's

is a program, which accepts token and lexemes, if it is known and not a stop-word if it is a stop-word unknown



API for developing specialized dictionaries Built-in dictionary-templates : ispell ( works with ispell, myspell, hunspell dicts ) snowball stemmer synonym, thesaurus simple


Dictionaries:examples


Intdict


'123456789' -> '123456' 'XIX' -> '19' 'FFFFFF' -> 'white' H(\s|-)?(alpha|beta|gamma) h$2 -- spectral lines



R o m an




Colours




Regexp






void* ini t(Lis t *op tions )
TSLexeme* lexize(void *dict, char *lexeme, int32 lexeme_len, DictSubState *state)



. state ­ , (). NULL ­ .


TSLexeme


typedef struct { uint16 nvariant; uint16 flags; char lexeme; /* C-string */ } TSLexeme;



lexize() , NULL (TSLexeme->lexeme). - -.


Ispell



'' => ' | | | '
nvariant | lexeme ---------+---------1 | 2 | 3 | 4 |





­ (TSLexeme->nvariant),


Agglutinative languages
(, ...)


http://en.wikipedia.org/wiki/Agglutinative_language





Footbalklubber ­ Klubber on footbalfield ­ .


Ispell



'foobar' => 'foo & bar'
nvariant | lexeme ---------+---------1 | foo 1 | bar



­ , . .


Ispell






'footbalklubber' => ' ( football & klubber ) | ( foot & ball & clubber ) ' nvariant | lexeme ---------+---------1 | football 1 | klubber 2 | foot 2 | ball 2 | klubber , ­







TSLexeme->flags TSL_ADDPOS () TSL_PREFIX (8.4, )


Motivation for Algebra


Motivation for Algebra


Existing operators defined at document level
A OR B

A AN D B

A NOT B


Motivation for Algebra




Phrase Search requires operation at lexeme level -- operator BEFORE ($) Different semantics - A AND NOT B
Phrase search: «A $ X» (X is anything, except B) Phrase can be very complex Even simplest phrase can be transformed to complex expression.




to_tsquery('nb', 'telefonsvarer') => 'telefonsvarer' | 'telefon' & 'svar'


Motivation for Algebra


Phrase can be constructed programmatically, or manually ( SMTH::tsquery) «A $ ( B $ !(C $ !D))»





We need well-defined algebra for operations: & | ! $ Backward compatibility !


Motivation for Algebra


We consider generalized phrase a $[n] b



Operator BEFORE ($n) guarantee


order of operands -- a BEFORE b
Distance between operands, default is 1

a $[n] b == a & b & ( i,j : pos(b)i ­ pos(a)j = n)


Query: «black» $ («hole» | «nebulae») ==> «black» $ «hole» | «black» $ «nebulae»

|

&

&

&

&

«black»

«hole»

«black»

«nebulae» pos(«hole»)-pos(«nebulae»)=1

pos(«hole»)-pos(«black»)=1


Example
Query: «close» $ «galaxies» After dictionary: «close» $ («m33» | («andromeda» $ «nebulae» | («magellanic» & «clouds»)) Phrase: «close» $ «m33» | ( «close» $ («andromeda» $ «nebulae» )) | ( «magellanic» & «clouds» & ( «close» $ «magellanic» | «close» $ «clouds») )


| $ $

| «m33» &

«close» «close»

$

&

|

«nebulae»

«andromeda»

«magellanic»

«clouds»

$

$

«close»

«magellanic» «close» «clouds»


Phrase Search


Possible extensions






#[n] -- soft $[n], order doesn't important a<$[n] b -- at most n words between operands a $[n]> b -- at least n words between operands And so on ...