Zen: NewExtentsBasedRanking

Extents based ranking for tsearch2

Extent is a shortest and non-nested sequence of words, which satisfy a query.
weight of an extent is inversely proportional to the number of non-query words.
query is a sequence of words and boolean operations, for example, 'a & b'.

We don't consider normalization of document weight to 1, since we have no global information about document collection.

Designations

w{i}, len{i} - weight and length of i-th extent
lenq - query length
C{k} - coefficient, which describes an importance of various document parts (0<=C{k}<=1). Tsearch2 supports 4 parts.
Next - the number of unique extents.
freq{i} - frequency of i-th extent

We could consider strict and soft uniqueness of an extent, depending on how do we consider the order of words in extent.

The procedure

1. For each i-th extent calculate weight

w{i} = Cpos * 1/(1+#non-query words)

We could soften a bit this formulae using parameter k, which is the number of words we don't take into account. Cpos is a harmonical mean of weights of words consisting i-th extent.

Cpos takes into account that extent could consists of words with different weights.

Example: For document

<a>a b</a><b>c d e f</b><c>a i t</c>

and query 'b & d & e & i' we have extent 'b c d e f a i', which crosses 3 parts (1a4b2c). For C(a)=1; C(b)=0.5; C(c)=0.2

Cpos = 7/(1/1+4/0.5+2/0.2) = 0.37

If no information is available about non-query words (for performance reason, for example), then it's possible to use only query words. For the example above, we approximate extent 1a4b2c by 1a2b1c, so CposApprox = 4/(1/1+2/0.5+1/0.2) = 0.36

Try to take into account extent's frequency:

W{i} = SUM(j=1,freq{i})[ w{j}/j^2 ]  (1)

Notice, w{j} is a sorted array and max(W{i})= Pi^2/6; If information about extents frequency in a document is expensive to get, then we could skip this step and consider all extents as unique and use W{i}=w{i}.

For one-term query we know frequency directly from tsvector and we could collapse all equal extents to one using (1).

2. Document weight

W = SUM(1,Next)W{i}

Maximal value is known, WMAX = Next*(Pi^2/6) and it's possible to normalize it, so that WMAX=1, but it's a bad practice.

Summary

W = SUM(i=1,Next)[Cpos{i}/Nonqw{i}], here Nonqw{i} is the number of non-query words in i-th extent.

Cpos{i}=len{i}/SUM(k=1,len{i})[l{k}/C{k}] or, we could use CposApprox{i} = lenq/SUM(k=1,lenq)[l{k}/C{k}]

For homogeneous both extents and document (all C{k}=Cdoc)

W = Cdoc/SUM(i=1,Next)[len{i} - lenq + 1]

For one-term query we have W1q=Cdoc*Next

It's clear, that we should apply document length normalization, because Next ~ DocLength (see below).

Document length normalization

We should apply some normalization technique to take into account that longer document usually have more extents than short documents, i.e., more repeated extents (frequency) and more different extents. Simple technique is to divide document weight by ratio (or logarithm of ration) of document length to average length of a document in the collection. Document length could be counted in terms of words (all words or unique words) or in bytes. Unfortunately, we don't know global statistics, for example, an average length of documents in the collection, so we could use only local (document wide) statistics.

W = W/(1+log(NUW)).

NUW - is the number of unique words in a document and easily obtained from tsvector.

Taking into account extents density

Under all equal conditions we prefer documents which distribution of extents is more concentrated.

Density distribution could be approximated by Dmean - harmonical mean of all distances between extents.

coordinates: 1 2 3 4 5 6 500
distances: 1,1,1,1,1,494
mean=83.1666666666667
geom=2.81160619455686
harm=1.19951436665318
median=1

Cden = 1/(1+log(Dmean)) - correction for density is a global parameter, so it can be applied to the final document weight.

W = W*Cden.

rank_cd interface

Current interface of rank_cd:

rank_cd('{0.1, 0.2, 0.4, 1.0}',tsvector,tsquery,normalization)

normalization is used to select normalization method(s) and is OR-ed value of following methods:

1 - 1+log (document length)
2 - document length
4 - mean harmonic distance between extents
8 - the number of unique words in document
16 - 1+log (the number of unique words in document)

There is no normalization by default !

Example - normalize document rank by logarithm of document length and take into account extents density.

rank_cd('{0.1, 0.2, 0.4, 1.0}',tsvector,tsquery,1|4)

References

"Relevance Ranking for One to Three Term Queries", 2000, Charles Clarke, Gordon Cormack, Elizabeth Tudhope ( full-text )
"Shortest-Substring Retrieval and Ranking", 2000, Charles Clarke, Gordon Cormack, (full-text )

lyrical digression

What hammer is better ?

  A                         B
    EEEEEEEEEEEEEE            EEEEE
          D                     D
          D                     D
                                D
                                D

Explanation:

EE…EE - denotes extents based rank
DD…DD - rank based on other feature (density or distance from the beginning of document)

NewExtentsBasedRanking

Projects

Links