Zen: Tsearch V2 compound words

Support for compound words

This work was sponsored by ABC Startsiden. Documentation was written by Henning Spjelkavik

Tsearch2 has support for splitting compound words based on an ispell dictionary with appropriate tagging of the words.

This documentation assumes that the reader has knowledge of general Tsearch2 configuration. We'll just elaborate on the modifications necessary to get support for compound words.

DICTIONARIES

Our test case is the Norwegian language, with the myspell dictionaries from the OpenOffice project. The dictionaries has to be converted from myspell to ispell source files. The provided utility my2ispell handles this. Alternatively, you may use the original Norwegian ispell dictionary.

The words in your .dict file must be marked with a special flag to be able to participate in a compound word. This flag is set in the aff file with the compoundwords controlled statement. In the OpenOffice dictionaries, the flag is z.

To enable the testing of this module without changing your default dictionary settings, we will use the core functions of Tsearch2 with the optional dictionary or configuration name. That's why the first argument of most functions is either norwegian_ispell, norwegian_snowball or default_norwegian.

INSTALLATION

Install tsearch2
- gmake && gmake install
- psql DB < tsearch2.sql

*Download and install an appropriate dictionary

*Convert OpenOffice dictionary with my2ispell

*Configure database

  insert into pg_ts_cfg values ('default_norwegian', 'default', 
                                'no_NO.ISO8859-1');
  insert into pg_ts_dict values (
         'norwegian_ispell',
         (select dict_init from pg_ts_dict where dict_name='ispell_template'),
         'DictFile="/usr/local/share/ispell/norsk.dict" ,'
         'AffFile ="/usr/local/share/ispell/norsk.aff"',
        (select dict_lexize from pg_ts_dict where dict_name='ispell_template'),
        'Norwegian ISpell dictionary'
  );

  insert into pg_ts_cfgmap (select 'default_norwegian', tok_alias, dict_name 
                            from pg_ts_cfgmap where ts_name='default_russian');

  update pg_ts_cfgmap set dict_name='{norwegian_ispell}' 
        where ts_name='default_norwegian' and (dict_name='{ru_stem}' or 
              dict_name='{en_stem}');

SOME EXAMPLES

# select lexize('norwegian_ispell','overbuljongterningpakkmesterassistent');
 {over,buljong,terning,pakk,mester,assistent}

# select lexize('norwegian_ispell','politimester');
 {politimester,politi,mester,mest}

# select lexize('norwegian_ispell','sjokoladefabrikk');
 {sjokoladefabrikk,sjokolade,fabrikk}

# select to_tsvector('default_norwegian','Overbuljongterningpakkmesterassistenten gikk en 
tur til sjokoladefabrikken for &aring; kj&oslash;pe overtrekksgrilldresser'); 
'&aring;':7 'en':3 'til':5 'tur':4 'gikk':2 'kj&oslash;p':8 'over':1 'pakk':1 'dress':9 'grill':9 'kj&oslash;pe':8
'mester':1 'buljong':1 'fabrikk':6 'terning':1 'assistent':1 'overtrekk':9 'sjokolade':6 'sjokoladefabrikk':6

STOP WORDS

When indexing text you might want to avoid indexing the most common words, because they will be present in almost every text, and will most likely not add any value to the search result. After a careful analysis of your target language you'll want to add a stop word list.

We modified the snowball stop word list, and installed it as /usr/local/pgsql/share/contrib/norsk.stop.

Let's update our configuration of norwegian_ispell with a StopFile directive.

update pg_ts_dict set dict_initoption=dict_initoption||',
StopFile="/usr/local/pgsql/share/contrib/norsk.stop"' 
where dict_name='norwegian_ispell';

(Remember to flush your psql-session to pick up the changes)

The common words å, til and en are removed from the vector:

# select to_tsvector('default_norwegian','Overbuljongterningpakkmesterassistenten gikk en
tur til sjokoladefabrikken for &aring; kj&oslash;pe overtrekksgrilldresser');)
 'tur':4 'gikk':2 'kj&oslash;p':8 'pakk':1 'dress':9 'grill':9 'kj&oslash;pe':8 'mester':1 'buljong':1 'fabrikk':6 
'terning':1 'assistent':1 'overtrekk':9 'sjokolade':6 'sjokoladefabrikk':6
(1 row)

Let's examine what kind of compound words we can now successfully split:

-- Here we have juxtaposition of two stems (i.e. placed side by side)
# select lexize('norwegian_ispell','telefonsvarer');
 {telefonsvarer,telefon,svar}
# select lexize('norwegian_ispell','bokhylle');
 {bokhylle,bok,hylle}
# select lexize('norwegian_ispell','kj&oslash;pesenter');
 {kj&oslash;pesenter,kj&oslash;pe,senter}

-- ...epenthetic 's' and 'e'
# select lexize('norwegian_ispell','l&oslash;rdagsrevyen');
 {l&oslash;rdagsrevyen,l&oslash;rdag,revy}
# select lexize('norwegian_ispell','barnetrygden');
 {barnetrygd,barn,trygde}

-- ... epenthetic 'ings' (probably)
# select lexize('norwegian_ispell','kvalifiseringsspill');
 {kvalifiseringsspill,kvalifisere,spill}

-- ...and finally a recursive split using every method known
# select lexize('norwegian_ispell','treningsl&aelig;re');
 {treningsl&aelig;re,trening,trene,l&aelig;re,l&aelig;r,trening,trene,l&aelig;re,l&aelig;r}

INSTALLATION OF SNOWBALL STEMMER

To cut off different endings of the same word, and thus index and search for the linguistic root of the words, we may add the use of stemming.

Install Norwegian stemmer
- Get stemmer from Snowball site and put them in tsearch2/gendict
- cd tsearch2/gendict
- ./config.sh -n norwegian_snowball -s -p norwegian -v -C 'Norwegian Snowball dictionary'
- cd ../../dict_norwegian_snowball
- gmake && gmake install
- psql DB < dict_norwegian_snowball.sql

Update your Tsearch2 configuration with default use of both the snowball stemmer and ispell.

  update pg_ts_cfgmap set dict_name='{norwegian_ispell,norwegian_snowball}' 
        where ts_name='default_norwegian' and (dict_name='{ru_stem}' or 
              dict_name='{en_stem}' or dict_name='{norwegian_ispell}');

HOW TO CHANGE YOUR DEFAULT

-- Setting default configuration (for to_tsvector and to_tsquery) for this session :
# select set_curcfg('default_norwegian');

-- Setting default dictionary (for lexize) for this session:
# select set_dictcfg('norwegian_ispell');

<<<<<<< you

||||||| ancestor
=======

== Compound word ===

<pre>
=# CREATE TEXT SEARCH DICTIONARY nb_no_ispell ( TEMPLATE = ispell,
DictFile = nb_no, AffFile = nb_no );
=# select ts_lexize('nb_no_ispell', 'telefonsvarer');
          ts_lexize
------------------------------
 {telefonsvarer,telefon,svar}
=# CREATE TEXT SEARCH CONFIGURATION public.no ( COPY=pg_catalog.norwegian);
=# ALTER TEXT SEARCH CONFIGURATION  no ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,word, hword, hword_part WITH nb_no_ispell, norwegian_stem;

=# select to_tsquery('no','telefonsvarer & device');
                     to_tsquery
----------------------------------------------------
 ( 'telefonsvarer' | 'telefon' & 'svar' ) & 'devic'
=# select to_tsvector('no','telefonsvarer  device');
                   to_tsvector
--------------------------------------------------
 'devic':2 'svar':1 'telefon':1 'telefonsvarer':1

</pre>

>>>>>>> other

Tsearch V2 compound words

Projects