Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://www.sai.msu.su/~megera/wiki/OverBlogFts
Дата изменения: Unknown Дата индексирования: Sun Apr 10 16:09:50 2016 Кодировка: Поисковые слова: п п п п п п п п п п |
There is a big dynamical storage of documents in external database, which needs to be indexed and searched. The challenge is to use the full text extension to PostgreSQL database (contrib/tsearch2) as a scalable solution. There are about 5 mln documents in database and about 20,000 documents are coming every day. The requirement to the system is to serve about 100,000 search requests per day with ability to use sophisticated ranking, based on different thingies.
To meet this requirement we design search daemon which accepts queries, transfer them to postgresql database and stores the results in the cache, organized as a LRU buffer in shared memory. Clients and search daemon communicate in according to described API. Search daemon was implemented as a fcgi program (C-language),which invoked by lighthttpd server, responsible for communication with clients. We developed Gin (Generalized inverted index) to scale tsearch2 module. Testing of the system was performed on two (actually, three) machines using parallel tester scripts. Input queries were randomly chosen from the ranked list of words collected from the all documents (about 50,000 unique words with frequency more than 100 documents). We tested direct full text searches in database as well as indirect searches using search daemon. Unfortunately, we were not able to get real-life query statistics to simulate more realistic workload. On the server with 8Gb RAM (3Gb for PostgreSQL buffer) we were able to get about 1mln/req per day! The rule of thumb for choosing a good database server is - more RAM for both - database and system, and good raid (10) for disk storage. RAM is used to cache disk blocks, which greatly increases the performance.
Base directory is /home/megera/app/over
Several things to know: