Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://old.philol.msu.ru/~lex/khmelev/published/jql/khmelev.doc
Äàòà èçìåíåíèÿ: Thu Oct 17 00:00:00 2002
Äàòà èíäåêñèðîâàíèÿ: Mon Oct 1 22:11:14 2012
Êîäèðîâêà:

Disputed Authorship Resolution

through Using Relative Empirical Entropy

for Markov Chains of Letters

in Human Language Texts


D.V. Khmelev

Summary
The problem of disputed authorship resolution is solved here by the formal
analysis of texts. The method of the analysis is based on the Markov Model
for the sequence of letters in text. We assume that the frequencies of
letter pairs are very specific for an author. This assumption is checked in
the large statistical experiment which was carried out for 386 text samples
(stories, novels, and their combination) over stories and novels of 82
Russian fiction writers.


1. Introduction

First attempts in the search of specific quantitative structures for
individual author style took place in the beginning of the twentieth
century (Morozov, 1915). A famous mathematician Markov (1916) criticized
the method offered by N.A. Morozov. A.A. Markov argued that characteristics
of writer's style (individual word frequencies, e.g., the distribution of
negative word «ne») offered by Morozov (1915) are unstable. A.A. Markov had
given an example of good statistical approach (Markov, 1913). In this work
A.A. Markov studied the distribution of vowels and consonants among initial
20000 letters of «Evgenij Onegin». This paper (Markov, 1913) is the first
application of «events which are linked to the chain». In the last century
this object received the name «Markov chains». In linguistic studies this
object is said to be the Markov Model. Markov (1913) presents not only a
historical basis for the present article, but also a practical one.
Modern methods of disputed authorship resolution are reviewed by Milov
(1994) in Russia and Holmes (1998) in West. All methods described there has
a common disadvantage although there is the amazing variety of them. None
of these methods have been tested by use of large number of authors.
Moreover, the testing of most of these methods is impossible because they
have informal steps which need a human participation (for example, some of
them require manual separation of words into various classes). Clearly,
such an approach prevents checking of a large number of authors and,
indeed, all of the methods are used for analysis of a small number of
authors and for a limited number of texts.
Fomenko et al. (1996) have another approach. They select a simple
parameter of author's style: the fraction of functional words. They check
the stability of this fraction over large number of Russian writers. So
this fraction is specific enough for each author and is called by Fomenko
et al. (1996) an author's invariant.
Another method of disputed authorship resolution is offered in the
present work. This method is totally independent of all methods described
by Fomenko (1996), Holmes (1998) and Milov (1994). Here we shall not apply
this method to resolution of well-known problems of disputed authorship for
specific texts. The present article is mainly theoretical and is aimed on
testing of the offered method. The method is based on Markov Model for the
sequence of letters.
Assume we have wide range of large texts in Russian of known
authorship. A text under question is known to be written by one of these
writers but we do not know who the true author is. Our goal is to find the
author who is closest to the text under question according to the proposed
measure. Note also that for control purposes we do know a real author for
each of "doubtful" texts.
Consider a sequence of letters of text as a Markov chain. The matrices
of transition frequencies of letter pairs are calculated over all texts of
each author. Therefore we know (approximately) the probability of
transition from one letter to another for each author. The true author of
an anonymous text is calculated with the principle of maximal likelihood,
i.e., for each matrix we calculate the probability of the anonymous text,
we choose the author with the maximal corresponding probability and the
chosen author is supposed to be the true author. Suppose we take the
logarithm of each probability, change the sign, and divide it by the length
of anonymous text; then each of the obtained numbers is called the relative
empirical entropy. Relative empirical entropy is more convenient for
computing. The true author has the minimal relative entropy.
This method is amazingly precise. We discuss here the features of
methods application and compare it to another method which is based on
isolated letter frequencies in the text. The method is checked on 386 text
samples of 82 writers. There are a lot of well-known Russian writers of the
nineteenth and twentieth centuries among them.

2. Mathematical Foundations

2.1 Markov Model. Suppose A is a set of letters. Let Ak be a set of words
of length k over alphabet A and A*=(k>0 Ak. By | f | denote the length of
f(A* .
Consider the following formalization of the problem of true writer
recognition. Take n sets Ci, where i=0,...,n-1. The set Ci contains
sequences fi,j(A*, where j=1,...,mi, i.e., Ci={ fi,j | j=1,...,mi}. We
shall assign to x(A* the set Ci.
Let fi,j be a sample from Markov chain with transition matrix (i. Now
we shall define the estimate Pi for (i. Let hi,j,kl be the number of letter
pairs (transitions) k(l in fi,j. We have hi,kl=(jhi,j,kl and hi,k=(lhi,kl.
By definition, put Pikl = hi,kl / hi,k. Let Zi be a set of pairs (k,l) such
that Pikl>0.
Also, let x be a sample from a Markov chain with transition matrix P(,
where ( is unknown number in {1,...,n}.
Let (kl denote the number of transitions k(l in x. By definition, put
(k=(l (kl. Consider
(i(x) = -((k,l) (kl(ln((kl /(Pikl((k)),
where the sum is calculated over (k,l)(Zi . Let us remark that if Zi
contains all possible pairs (k,l), then (i(x) is the minus logarithm of the
probability of the sequence x when the sequence x is a Markov chain with
transition matrix Pi. The normalized number (i(x) / | x | is called the
relative empirical entropy with respect to the set Ci. Define the maximum
likelihood estimate t(x) of the unknown ( by the rule
t(x)=argmini=0,...,n-1 (i(x) / | x |. (2.1)
We shall not discuss or prove any mathematical properties of estimate (2.1)
although these properties are interesting for statistics (Ivchenko and
Medvedev, 1992, p. 224). But we shall prove that estimate (2.1) is
amazingly effective in true author recognition.

2.2 Computational procedure. Take A={small cirillic letters}({space
symbol}. Suppose we have quite a large texts of n authors in Russian. Let
gi,j denote the j-th text of the author i. The text gi,j is a sequence of
symbols of extendent alphabet B, where the alphabet B contains punctuation
symbols, capital letters, Latin letters etc (set B is extended ASCII).
To each fragment gi,j(B* assign the sequence F(gi,j)(A*. Let F be the
map from B* to A* such that all capital letters convert to lower letters,
all hyphens are attached, all punctuation and extra space symbols drop, a
delimiter space symbol between words stays, and a space symbol is inserted
at the beginning and at the end of the fragment if absent.
Besides, consider the map G: B*( A*. The map G has the same structure
as F, but the map G drops all words with capital letter in text gi,j . In
particular, if
y=«Krome togo, my budem rassmatrivat' funktsiju G,»,
then F( y)=« krome togo my budem rassmatrivat' funktsiju »,
and G( y)=« togo my budem rassmatrivat' funktsiju ».
Now suppose an text y(B* is written by one of n authors, but the true
author is unknown. How do we find the true author of y? Using
estimate (2.1) with sequences x=F( y) and x=G( y) we obtain two ways for
determining the author:
1) the true author is t(F( y)),
2) the true author is t(G( y)).
Note that estimates t(F( y)) and t(G( y)) use only information on
letter pairs. Estimates t(F( y)) and t(G( y)) are independent of word order
because all words are delimited by space symbols. Perhaps, t(F( y)) and
t(G( y)) are based on mediated information on morphemic units (and their
combinations within word forms), and surely use no syntactical and
phraseological information (methods described by Milov (1994) are based on
syntactical information).
Formally any natural language text is not a Markov chain, i.e., this
hypothesis is rejected by decision rules of statistics. But one can
formally calculate and apply estimate (2.1) to human language texts taking
into account stability of estimates obtained by the principle of maximal
likelihood. In the present paper we show that estimate (2.1) gives the true
author in most cases.

2.3 Frequency analysis for isolated letters. Assume that the sequences fi,j
and x are sample from i.i.d.r.v. taking values in A. By the previous
statement, estimate (2.1) takes the following form
e(x) = argmini (i(x) / | x |, (2.2)
where
(i(x) = -(k (kln(((k(hi)/(hi,k(()),
and the sum is taken over k such that (k >0 and (=(k (k, hi=(k hi,k . Note
that estimate (2.2) is usually called frequency analysis. Our calculation
show that e(x) is significantly worse than t(x).

3. An example of research

Consider an example of our approach for checking the quality of the
method for disputed authorship resolution. We shall check the estimate
t(F( y)) on texts of K. Bulychev, A. Volkov, N.V. Gogol' and V. Nabokov.
We offer the following way of checking. Choose a control text y i for
each author (i = 0,1,2,3), calculate estimates P i for matrices (i with the
other texts fi,j, and calculate t(F( y i)). If the estimate is good then
for each author t(F( y i))=i.
0) K. Bulychev: Umenie kidat' mjach ( y0); Beloe plat'e zolushki
(g0,1); Velikij dukh i begletsy (g0,2); Glubokouvazhaemyj mikrob (g0,3);
Zakon dlja drakona (g0,4); Ljubimets [Sponsory] (g0,5); Marsianskoe zel'e
(g0,6); Miniatjury (g0,7); «Mozhno poprosit' Ninu?» (g0,8); Na dnjakh
zemletrjasenie v Ligone (g0,9); Pereval (g0,10); Pokazanija Oli N. (g0,11);
Pominal'nik XX veka (g0,12); Raskopki kurganov v doline Repedelkinok
(g0,13); Trinadtsat' let puti (g0,14); Smert' etazhom nizhe (g0,15);
1) A. Volkov: Sem' podzemnykh korolej ( y1); Volshebnik izumrudnogo
goroda (g1,1); Urfin Dzhjus i ego derevjannye soldaty (g1,2); Ognennyj bog
Marranov (g1,3); Genial'nyj pen' (g1,4); Na vojne, kak na vojne (g1,5); O
chem molchali gazety... (g1,6); Prestuplenie i nakazanie (g1,7); Epilog
(g1,8); Zheltyj Tuman (g1,9); Tajna zabroshennogo zamka (g1,10);
2) N.V. Gogol': Rasskazy i povesti ( y2, nazvanija povestej: «Povest'
o tom, kak possorilsja Ivan Ivanovich s Ivanom Nikiforovichem»,
«Starosvetskie pomeschiki», «Vij», «Zapiski sumasshedshego»); Revizor
(g2,1); Taras Bul'ba (g2,2); Vechera na khutore bliz Dikan'ki (g2,3);
3) V. Nabokov: Drugie berega ( y3); Korol', dama, valet (g3,1); Lolita
(g3,2); Mashen'ka (g3,3); Rasskazy (g3,4); Nezavershennyj roman (g3,5).
For example, A. Volkov has a control text y1, i.e., «Sem' podzemnykh
korolej.» The other texts of A. Volkov are used in computing of Pi.
The results of calculations are presented in the following table 1.

| | | |Table 1 | | |
|N |Author |c1 |c2 |c3 |c4 |
|0 |K. Bulychev |0 |15 |2345689 |75161 |
|1 |A. Volkov |0 |8 |1733165 |233418 |
|2 |N.V. Gogol' |0 |3 |723812 |243767 |
|3 |V. Nabokov |0 |5 |1658626 |367179 |

The column c2 contains the total number of files with texts of each
author. Note that the number of files and the number of texts are different
in the following two cases. Firstly, some texts of the author can be
collected in one file (this is happen in A. Volkov case: three stories
«Zheltyj Tuman», «Tajna zabroshennogo zamka» and «Ognennyj bog Marranov»
are collected in one file). Secondly, one large text can be cut into
smaller parts (recall this when you study the table 2).
The column c3 contains the total number of symbols (cyrillic letters
and spaces) in F(gi,j): c3=(j |F(gi,j)|. The column c4 contains the total
number of symbols in F( y i), i.e., c4=|F( y i)|. For example, the total
volume (j F( g0,j ) of texts for K. Bulychev is 2'345'689. The volume of
F( y1), i.e., the number of symbols of alphabet A in control story «Umenie
kidat' mjach» is 75'161.
The column c1 in line j contains the (j(F( y j)) rank. The ranking of
numbers {(i(F( y j)) | i = 0,1,2,3}. The rank of (j(F( y j)) is its number
(counting from 0) in set of numbers {(i(F( y j)) | i = 0,1,2,3} lined up
from smallest to biggest. For example, if j=1 and (i lined up in the
following order (0((3((2((1then (1 has a rank of 3. If j=0 and (i lined up
in the same order (0((3((2((1 then (0 has a rank of 0. The rank of
(j(F( y j)) in set of numbers {(i(F( y j) | i=0,1,2,3} equals the rank of
(j(F( y j))/|F( y j)| among numbers {(i(F( y j))/|F( y j)| | i=0,1,2,3}.
Put in lines j=0,1,2,3 of the following matrix the numbers
(i(F( y j))/|F( y j)|, i=0,1,2,3:

| | |2.484569|2.508425|2.504301|2.493778|
| |( = |2.501061|2.473907|2.516797|2.492874|
| | |2.499033|2.504508|2.480202|2.483829|
| | |2.541367|2.538101|2.548842|2.520018|


Rank (i in each line.
| | |0|3|2|1 |
| |R =|2|0|3|1 |
| | |2|3|0|1 |
| | |2|1|3|0 |


The required numbers of column c1 are the diagonal in matrix R.
Using (2.1), we obtain that t(F( y j))=j iff the number
(j(F( y j))/|F( y j)| in the set {(i(F( y j))/|F( y j)| | i=0,1,2,3} has a
rank of 0. Therefore, suppose we find 0 in column c1 of table 1; then the
authorship of the control text is established correctly. Note from the data
of table 1 that the authorship is established absolutely correctly.
Before we discuss this result we shall explain why we define column c1
in such a complicated way. Suppose the authorship is not established
correctly, i.e., t(F( y j)) ( j; then we are interested in how close to the
correct author we are. If (j(F( y j))/|F( y j)| has a rank of 1 in the set
{(i(F( y j))/|F( y j)| | i=0,1,2,3}, then the error is just in one author.
Clearly this case is much better than another case where
(j(F( y j))/|F( y j)| is of rank 3. (This note is useful in study of Table
2).
Besides, the matrix R has some interesting interpretations. For
example, it follows from the first line that the list of author-candidates
for the control text «Umenie kidat' mjach» of K. Bulychev is K. Bulychev,
V. Nabokov, N. Gogol' and A. Volkov. In following two lines V. Nabokov is
also the second candidate to control texts of A. Volkov and N. Gogol'. Is
it the corollary of historical position of V. Nabokov between N. Gogol' and
A. Volkov with K. Bulychev? If it is true then our method is sensitive to
historical epoch of the text. This hypothesis is supported by the data of
the last line of matrix R. The list of candidates to the control text of
V. Nabokov is V. Nabokov, A. Volkov, K. Bulychev, and N. Gogol'. If the
pair of A. Volkov and K. Bulychev is broken by N. Gogol' then it
contradicts the hypothesis. Nevertheless, there are other possible
interpretations of matrix R and the author of present paper does not insist
on the one above.
It is interesting to know how the matrix R depend on
a) the number and volume of training samples,
b) the homogeneity in genre,
c) the homogeneity in theme,
d) the size of control text,
e) the unit of analysis (the level of letters, words or sentences)
and many other parameters. Here we give some information on point a)
and d). Our method works well (i.e. the diagonal of matrix R is zero) when
the total size of training samples is more than 100 thousand ASCII symbols
and the size of control text is more than 100 thousand ASCII symbols.
Returning to discussion on table 1. The authorship of all control
texts is established absolutely correctly. The result is completely
unexpected because we use such primitive information as the frequencies of
letter pairs. Moreover, a simple computer study (we omit this results here)
showed that if there are a small number of author-candidates (less than 6)
even estimate (2.2) gives good results. Note that estimate (2.2) is based
just on the isolated letter frequencies! The full research is described in
the following section. The full study shows that estimate (2.1) is stable
even for large number of authors.


4. Results of extended research

Consider n=82 authors of the nineteenth and twentieth centuries given
in the electronic library at http://www.lib.ru. Various authors have from 1
to 30 different text samples e.g. Arkadij and Boris Strugatskij have 30
texts. One author, Boris Strugatskij has just one story. In this case the
story is cut into two separate texts, one of which acts as a control. The
total size of texts for each of chosen authors exceeds 100000 ASCII
symbols. The total number of novels, stories and short stories exceeds
1000. These texts are combined in 386 text samples. The total size of data
is 128(106 ASCII symbols.
For each author we make a list gi,j of texts for estimates Pi and for
each author we keep a control "doubtful" text y i. The control text y i is
not used in each of calculations of Pi. The total number of control text
samples is 82 and the total number of text samples in the training set is
304=386-82. Following research design presented in previous section, we
studied the quality of estimates t(F(()), t(G(()), e(F(()), e(G(()) for all
82 writers. For brevity we shall give the big table for t(G(()) only. This
table is filled out in a similar way with table 1. Again, for brevity we
omit corresponding tables ( and R.

Table 2

|N |Author |c1 |c2 |c3 |c4 |
|0 |K. Bulychev |0 |15 |2007724 |64741 |
|1 |O. Avramenko |0 |6 |1733113 |223718 |
|2 |A. Bol'nykh |0 |6 |1294721 |373611 |
|3 |A. Volkov |0 |8 |1478932 |202495 |
|4 |G. Glazov |0 |5 |1398323 |184593 |
|5 |M. and S. Djachenko |0 |5 |1754213 |197039 |
|6 |A. Etoev |0 |5 |267096 |80358 |
|7 |A. Kabakov |0 |4 |905502 |222278 |
|8 |V. Kaplan |0 |6 |515029 |129608 |
|9 |S. Kazmenko |3 |4 |1846161 |156768 |
|10 |V. Klimov |0 |3 |250231 |179903 |
|11 |I. Krashevskij |0 |2 |1183722 |481795 |
|12 |I. Kublitskaja |0 |1 |282377 |170469 |
|13 |L. Kudrjavtsev |1 |3 |583239 |179093 |
|14 |A. Kurkov |0 |6 |628041 |218726 |
|15 |Ju. Latynina |10 |2 |2628781 |283565 |
|16 |A. Lazarevich |46 |3 |310553 |94629 |
|17 |A. Lazarchuk |0 |5 |2395669 |210151 |
|18 |S. Lem |0 |7 |1568013 |343519 |
|19 |N. Leonov |0 |2 |568854 |279377 |
|20 |S. Loginov |14 |13 |1998543 |159247 |
|21 |E. Lukin |0 |4 |602216 |125694 |
|22 |V. Chernjak |0 |2 |920056 |201636 |
|23 |A.P. Chekhov |0 |2 |662801 |343694 |
|24 |I. Khmelevskaja |0 |4 |1524905 |203684 |
|25 |L. and E. Lukin |0 |3 |837198 |122999 |
|26 |S. Luk'janenko |0 |14 |3682298 |483503 |
|27 |N. Markina |0 |1 |266297 |93647 |
|28 |M. Naumova |0 |3 |306514 |337821 |
|29 |S. Pavlov |0 |2 |751836 |453448 |
|30 |B. Rajnov |0 |4 |1405994 |420256 |
|31 |N. Rerikh |0 |3 |1011285 |211047 |
|32 |N. Romanetskij |2 |6 |1305096 |117147 |
|33 |A. Romashov |0 |1 |88434 |87744 |
|34 |V. Rybakov |0 |6 |715406 |121497 |
|35 |K. Serafimov |0 |1 |186424 |75276 |
|36 |I. Sergievskaja |0 |1 |109118 |50786 |
|37 |S. Scheglov |10 |2 |253732 |55188 |
|38 |A. Schegolev |0 |2 |848730 |105577 |
|39 |V. Shinkarev |29 |2 |156667 |80405 |
|40 |K. Sitnikov |0 |7 |419872 |109116 |
|41 |S. Snegov |0 |2 |824423 |408984 |
|42 |A. Stepanov |0 |5 |1223980 |93707 |
|43 |A. Stoljarov |11 |1 |350053 |137135 |
|44 |R. Svetlov |0 |2 |454638 |268472 |
|45 |A. Sviridov |63 |3 |660413 |235439 |
|46 |E. Til'man |0 |2 |705352 |464685 |
|47 |D. Truskinovskaja |0 |8 |2005238 |118351 |
|48 |A. Tjurin |0 |18 |4109050 |110237 |
|49 |V. Jugov |0 |5 |829209 |66657 |
|50 |A. Molchanov |0 |1 |398487 |206541 |
|51 |F.M. Dostoevskij |1 |3 |613825 |88582 |
|52 |N.V. Gogol' |0 |3 |638339 |215540 |
|53 |D. Kharms |0 |2 |199449 |114889 |
|54 |A. Zhitinskij |0 |2 |2137325 |543037 |
|55 |E. Khaetskaja |2 |2 |723167 |204091 |
|56 |V. Khlumov |0 |3 |788562 |183358 |
|57 |V. Kunin |0 |3 |1335918 |296463 |
|58 |A. Melikhov |0 |1 |615548 |458086 |
|59 |V. Nabokov |0 |5 |1522633 |342774 |
|60 |Ju. Nikitin |0 |2 |1342176 |702383 |
|61 |V. Segal' |0 |2 |320218 |75917 |
|62 |V. Jan |0 |1 |507502 |600636 |
|63 |A. Tolstoj |0 |1 |129664 |97842 |
|64 |I. Efremov |0 |1 |536604 |256521 |
|65 |E. Fedorov |0 |1 |1120665 |221388 |
|66 |O. Grinevskij |0 |1 |158762 |96085 |
|67 |N. Gumilev |0 |1 |70181 |71042 |
|68 |L.N. Tolstoj |0 |1 |1225242 |199903 |
|69 |V. Mikhajlov |0 |1 |254464 |84135 |
|70 |Ju. Nesterenko |0 |1 |352988 |71075 |
|71 |A.S. Pushkin |0 |1 |170380 |57143 |
|72 |L. Reznik |0 |1 |115925 |79628 |
|73 |M.E. |0 |1 |239289 |101845 |
| |Saltykov-Schedrin | | | | |
|74 |V. Shukshin |0 |1 |309524 |66756 |
|75 |S. M. Solov'ev |0 |1 |2345807 |160002 |
|76 |A. Kats |0 |1 |841898 |81830 |
|77 |E. Kozlovskij |1 |1 |849038 |889560 |
|78 |S. Esenin |0 |1 |219208 |44855 |
|79 |A. Strugatskij |0 |1 |151246 |51930 |
|80 |A. and B. |0 |29 |6571689 |345582 |
| |Strugatskij | | | | |
|81 |B. Strugatskij |0 |1 |298832 |261206 |

First note that the number of right answers (zeroes in column c1) is
very high: 69. The true author is second in the candidate list (a 1 in
column c1) in 3 cases: L. Kudrjavtsev, F.M. Dostoevskij and E. Kozlovskij.
The true author is third (c1=2) in 2 cases: N. Romanetskij and E.
Khaetskaja. Only one author (S. Kazmenko) has the fourth position in list
of candidates (c1=3). The error of recognition is very high in case of the
other 7 authors (Ju. Latynina, A. Lazarevich, S. Loginov, S. Scheglov,
V. Shinkarev, A. Stoljarov, A. Sviridov). They are not in the list of top
ten of candidates.
The average rank is a sum of numbers in column c1 divided by the total
number of writers 82. The average rank is a measure for error of estimate
t(G(â)). Here the average rank is equal to
2.35((3(1+2´2+1´3+2´10+1´11+1´14+1´29+1´46+1´63) / 82
All these numbers are given in table 3 in column for t(G(â)). Suppose we
drop 7 hardly recognizable authors; then the average rank is
0.13(2/15=(3´1+2´2+1´3) / 75.
Now note that the method is applicable to poetry (A.S. Pushkin,
S. Esenin and N. Gumilev). Further note that Polish writers (S. Lem and
I. Khmelevskaja) are recognizable although their stories are translated
from Polish to Russian. Finally note that hardly recognizable authors are
not well-known classics.
The following table 3 contains results of similiar research with
estimates t(F(x)), e(F(x)), e(G(x)) on the same texts and authors.

Table 3
| | |c1 |t(F(â)|t(G(â)|e(F(â|e(G(â)|
| | | |) |) |)) |) |
| | |0 |57 |69 |1 |2 |
| | |1 |4 |3 |8 |8 |
| | |2 |4 |2 |7 |13 |
| | |3 |4 |1 |2 |2 |
| | |4 |0 |0 |3 |7 |
| | |(5 |13 |7 |61 |50 |
| | |Average|3.50 |2.35 |13.95|12.37 |

Note that the frequency analysis for isolated letters works badly (we
have at most 2 exact answers in the best case). Nevertheless, it gives some
information on author because if the true author is chosen at random then
the average result in columns e(F(â)) and e(G(â)) should be about 40. Note
also that the dropping of words with a capital letter makes the results
better (even in case of frequency analysis). Obviously, columns with
function G(â) perform better than columns with function F(â).


5. Conclusion

Note from the data of table 3 that estimate (2.1) is better than (2.2)
and estimate (2.1) gives right author in greater number of cases (84%
against 3%). Note that estimate (2.1) uses information on pairs of letters,
but estimate (2.2) uses only information on isolated letters frequency
distribution. Therefore the superiority of estimate (2.1) is expected. We
stress that the precision of estimate (2.1) is interesting. For instance,
the method of author's invariant (Fomenko et al., 1996) can not distinguish
more than 10 writers (here we consider more than 80 writers). Undoubtedly
this precision should attract the greater attention to the present method.
Note that the quality of author's recognition increases significantly
when we drop words with capital letter. This phenomenon needs an
explanation.
As we mentioned above A.A. Markov was interested in problem of
disputed author resolution. It is remarkable that his own idea on "events
which are linked to a chain" leads to new approaches to this problem today.
The author is grateful to M.I. Grinchuk for fruitful discussions. Also
the author thanks A.T. Fomenko and G.V. Nosovskij for attention to this
work and discussions on results. Also the author is grateful to
A.A. Polikarpov for discussions which influenced the final shape of this
paper (including its translation into English). Author thanks F.J. Tweedie
for some language corrections and references.


References


Fomenko, V.P. and Fomenko, T.G. (1996). Avtorskij invariant russkikh
literaturnykh tekstov. [Predislovie A.T. Fomenko.] (Author's Quantative
Invariant For Russian Literary Texts [Commentary by Academician
A.T.Fomenko]). In: Fomenko, A.T. Novaja khronologija Gretsii: Antichnost' v
srednevekov'e. T. 2. M.: Izd-vo MGU, p.768-820. (in Russian).
Holmes, D.I. (1998). The Evolution of Stylometry in Humanities
Scholarship. In: Literary and Linguistic Computing, Vol. 13, No. 3.
Ivchenko, G.I. and Medvedev, Yu.I. Matematicheskaja statistika.
(Mathematical Statistics) 2nd edition. M.: Vysshaja shkola, 1992. (in
Russian).
Markov, A.A. (1913). Primer statisticheskogo issledovanija nad tekstom
«Evgenija Onegina» illjustrirujuschij svjaz' ispytanij v tsep'. (An example
of statistical study on text of «Eugeny Onegin» illustrating the linking of
events to a chain). In: Izvestija Imp.Akad.nauk, serija VI, T.X, N3, c.153.
(in Russian).
Markov, A.A. (1916). Ob odnom primenenii statisticheskogo metoda (On
some application of statistical method). In: Izvestia Akademii Nauk.
(Russia). Ser.6, vol.X, N4, p.239 (in Russian).
Milov, L.V. (editor) (1994). Ot Nestora do Fonvizina. Novye metody
opredelenija avtorstva. (From Nestor to Fonvizin. New methods of
determining of authorship) M.: Progress publishing. (in Russian).
Morozov, N.A. (1915). Lingvisticheskie spektry (Linguistic
spectrums). In: Izvestia Akademii Nauk (Russia), (Section of Russian
Language), Books 1-4, vol.XX, (in Russian).