Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html
Дата изменения: Unknown Дата индексирования: Sat Dec 22 10:02:51 2007 Кодировка: Поисковые слова: п п п р п р п р п р п р п р п р п р п р п р п р п р п р п |
This how-to was written by Valli.First of all, let's give a look at the table pg_ts_parser:
CREATE TABLE pg_ts_parser ( prs_name text not null primary key, prs_start regprocedure not null, prs_nexttoken regprocedure not null, prs_end regprocedure not null, prs_headline regprocedure not null, prs_lextype regprocedure not null, prs_comment text ) with oids;If we want to create a new parser, we have to insert a new record into this table. We could do it in the following way.
INSERT INTO pg_ts_parser SELECT 'testparser', 'testprs_start(internal,int4)', 'testprs_getlexeme(internal,internal,internal)', 'testprs_end(internal)', 'testprs_headline(internal,internal,internal)', 'testprs_lextype(internal)', 'Testparser v0.03' ;As you see we have to create the following procedures before:
Desc: Initialises the parser. Interface: 1. Argument: C-Type: (char *) (IN) Desc: Pointer to the text which we parse 2. Argument: C-type: (int) (IN) Desc: length of the text Return value: PG_RETURN_POINTER(pst) where pst is a pointer to our selfdefined struct, which contains the parser state (let's name it ParserState, so pst is of the type (ParserState *))
Desc: Returns the next token. This procedure will be called so long as the procedure return type=0 Interface: 1. Argument. C-Type: (ParserState *) (INOUT) 2. Argument: C-Type: (char **) (OUT) Desc: token text 3. Argument: C-type: (int *) (OUT) Desc: length of the token text Return value: PG_RETURN_INT32(type); Desc: token type
Desc: Will be called after parsing is finished. We have to free our allocated resources in this procedure (ParserState). Interface: 1. Argument. C-Type: (ParserState *) (INOUT) Return value: PG_RETURN_VOID();
Desc: Generates headline. Interface: 1. Argument. C-Type: (HLPRSTEXT *) (INOUT) Desc: the parsed text (done by hlparsetext) 2. Argument: C-Type: (text *) (IN) Desc: the options string of the headline command 3. Argument: C-type: (QUERYTYPE *) (IN) Desc: the query (usually done by to_tsquery) Return value: PG_RETURN_POINTER(prs) where prs is of C-type (HLPRSTEXT *)
Desc: Returns an array containing the id, alias and the description of the tokens of our parser. Interface: 1. Argument. C-Type: ??? unused ??? Return value: PG_RETURN_POINTER(descr) where descr is of C-type (LexDescr *)
HLPRSTEXT is defined in ts_cfg.h (a tsearch2 headerfile) QUERYTYPE is defined in query.h (a tsearch2 headerfile) LexDescr is defined in wparser.h (a tsearch2 headerfile) ParserState is defined by ourselves text is defined c.h (it's a postgres type; will be included by postgres.h)So let's do a simple example which recognises space delimited words. It has only two types (3, word, Word; 12, blank, Space symbols). We will use the headline function of Teodor's default parser, which works very well for almost every parser. We can do this, because the headline function calls our parsing functions Actually this function is exactly that what we need for our example. If you want to write your own headline function, you'll find some notes from Teodor at the end.
-------------------------------------------------------------------------------- - test_parser.c -------------------------------------------------------------------------------- /* * testparser v0.02 */ #include#include #include #include "postgres.h" #include "utils/builtins.h" /* * types */ /* self-defined type */ typedef struct { char * buffer; /* text to parse */ int len; /* length of the text in buffer */ int pos; /* position of the parser */ } ParserState; /* copy-paste from wparser.h of tsearch2 */ typedef struct { int lexid; char *alias; char *descr; } LexDescr; /* * prototypes */ PG_FUNCTION_INFO_V1(testprs_start); Datum testprs_start(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(testprs_getlexeme); Datum testprs_getlexeme(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(testprs_end); Datum testprs_end(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(testprs_lextype); Datum testprs_lextype(PG_FUNCTION_ARGS); /* * functions */ Datum testprs_start(PG_FUNCTION_ARGS) { ParserState *pst = (ParserState *) palloc(sizeof(ParserState)); pst->buffer = (char *) PG_GETARG_POINTER(0); pst->len = PG_GETARG_INT32(1); pst->pos = 0; PG_RETURN_POINTER(pst); } Datum testprs_getlexeme(PG_FUNCTION_ARGS) { ParserState *pst = (ParserState *) PG_GETARG_POINTER(0); char **t = (char **) PG_GETARG_POINTER(1); int *tlen = (int *) PG_GETARG_POINTER(2); int type; *tlen = pst->pos; *t = pst->buffer + pst->pos; if ((pst->buffer)[pst->pos] == ' ') { /* blank type */ type = 12; /* go to the next non-white-space character */ while (((pst->buffer)[pst->pos] == ' ') && (pst->pos < pst->len)) { (pst->pos)++; } } else { /* word type */ type = 3; /* go to the next white-space character */ while (((pst->buffer)[pst->pos] != ' ') && (pst->pos < pst->len)) { (pst->pos)++; } } *tlen = pst->pos - *tlen; /* we are finished if (*tlen == 0) */ if (*tlen == 0) type=0; PG_RETURN_INT32(type); } Datum testprs_end(PG_FUNCTION_ARGS) { ParserState *pst = (ParserState *) PG_GETARG_POINTER(0); pfree(pst); PG_RETURN_VOID(); } Datum testprs_lextype(PG_FUNCTION_ARGS) { /* Remarks: - we have to return the blanks for headline reason - we use the same lexids like Teodor in the default word parser; in this way we can reuse the headline function of the default word parser. */ LexDescr *descr = (LexDescr *) palloc(sizeof(LexDescr) * (2+1)); /* there are only two types in this parser */ descr[0].lexid = 3; descr[0].alias = pstrdup("word"); descr[0].descr = pstrdup("Word"); descr[1].lexid = 12; descr[1].alias = pstrdup("blank"); descr[1].descr = pstrdup("Space symbols"); descr[2].lexid = 0; PG_RETURN_POINTER(descr); } -------------------------------------------------------------------------------- - Makefile -------------------------------------------------------------------------------- CC = gcc CFLAGS = -O3 -I. -I/usr/include/postgresql/server -Wall LIBS = DEPENDS = test_parser.o test_parser.so.0: ${DEPENDS} $(CC) $(CFLAGS) -shared -o $@ ${DEPENDS} ${LIBS} test_parser.o: test_parser.c $(CC) $(CFLAGS) -c $*.c clean: rm -f core core.* *.o *.a *~ *% *.so.* install: cp -f test_parser.so.0 /usr/lib/postgresql/ -------------------------------------------------------------------------------- - test_parser.sql -------------------------------------------------------------------------------- -- installing test parser CREATE FUNCTION testprs_start(internal,int4) RETURNS internal AS '$libdir/test_parser.so.0' LANGUAGE 'C'; CREATE FUNCTION testprs_getlexeme(internal,internal,internal) RETURNS int4 AS '$libdir/test_parser.so.0' LANGUAGE 'C'; CREATE FUNCTION testprs_end(internal) RETURNS void AS '$libdir/test_parser.so.0' LANGUAGE 'C'; CREATE FUNCTION testprs_lextype(internal) RETURNS internal AS '$libdir/test_parser.so.0' LANGUAGE 'C'; INSERT INTO pg_ts_parser SELECT 'testparser', 'testprs_start(internal,int4)', 'testprs_getlexeme(internal,internal,internal)', 'testprs_end(internal)', 'prsd_headline(internal,internal,internal)', 'testprs_lextype(internal)', 'Testparser v0.03' ; INSERT INTO pg_ts_cfg SELECT 'testcfg', 'testparser', NULL ; INSERT INTO pg_ts_cfgmap SELECT 'testcfg', 'word', '{simple}' ;
Place test_parser.c, Makefile and test_parser.sql in a directory (let's say $dir) $dir > vi Makefile -> change '/usr/lib/postgresql/' to your own $libdir -> change '/usr/include/postgresql/server' to the directory where postgres.h resides $dir > make $dir > make install (as root) $dir > psql -Upostgres yourdb yourdb=# \i $dir/test_parser.sqlnow it's ready to test it:
yourdb=# SELECT * FROM parse('testparser','That''s my first own parser 4 tsearch2!'); tokid | token -------+----------- 3 | That's 12 | 3 | my 12 | 3 | first 12 | 3 | own 12 | 3 | parser 12 | 3 | 4 12 | 3 | tsearch2! (13 rows) yourdb=# SELECT to_tsvector('testcfg','That''s my first own parser 4 tsearch2!'); to_tsvector --------------------------------------------------------------------- '4':6 'my':2 'own':4 'first':3 'parser':5 'that\'s':1 'tsearch2!':7 (1 row) yourdb=# SELECT headline('testcfg', 'That''s my first own parser 4 tsearch2!', to_tsquery('testcfg', 'that''s')); headline ----------------------------------------------- That's my first own parser 4 tsearch2! (1 row)
typedef struct { //lexeme length uint16 len; uint8 // lexeme should be selected (fooo) selected:1, //if 1, lexeme to be output in:1, //if 1 - skip. (this field will be erased in close future, //because it repeats 'in' field :) skip:1, //lexeme should be replaced to space replace:1, //the same lexeme as previous but linked to other ITEM in query. //it sets by hlparsetext repeated:1; uint8 type; //lexeme's type char *word; ITEM *item; //pointer to query part } HLWORD;