Lexical Resources for Personal Data Recognition in Texts

Nina N. Leontyeva, S. Semenova

Abstract

The system POLITEXT (Leontyeva 1995) was aimed at automatic understanding of texts corpora about Russian political life. Starting from a corpora of Natural-Language Texts (NLT) - newspaper messages, news agencies materials, governmental decrees and statements, etc. - the system POLITEXT is to extract the most essential or some ordered parts of knowledge from these corpora, to include it into DBs already existing, or into some new bases - more "intelligent" ones. Later the POLITEXT was developed into the information system RUSSIA that is running at the moment very successfully at Moscow State University, mainly on the terminological level (Dobrov et al. 1998; Loukashevitch et al. 2000). We will dwell on linguistic aspects and Lexical Resources of POLITEXT branch having in mind some intermediate semantic representation of a whole text as the basis for differentiated Information Extraction. It is personal data (in the social sphere) recognition in texts that we focus on, see (Leontyeva et al. 2001).

1. Outline of the POLITEXT Lingware

A scheme of lingware we were creating for the whole system consists of several large modules: Graphematical Analysis (GraphAn) of text material followed by the processing of set phrases; Morphological Analysis (MorphAn) and Lemmatization of word forms; Syntactic Analysis (SynAn) of Sentences; Terminological Analysis (TermAn); Thematic (or Topic) Analysis (ThemAn); Semantic Analysis (SemAn) of paragraphs and of the whole text including construction of the Base of Textual Facts (BTF). The TermAn and ThemAn modules form the independent subsystem of automatic indexation and thematization that is a living part of the system RUSSIA (Dobrov et al. 1998; Loukashevitch et al. 2000).

The central component of the POLITEXT system is a semantic one that ensures transition from linguistic units to content units, familiar to users. To meet this task the system builds primary semantic representation (Semantic Space) that is a basis for extracting various TASK-oriented or USER-oriented substructures. These are frames of persons (for example of VIP, Leontyeva et al. 2001), organizations, events accompanied by their own frames, temporal data, places of events, parametric information, etc. Thus we could obtain several "dimensions" of a text regarding this set of specific data bases (DBs) as a multidimensional semantic (or simply, multisemantic) structure.

The final multisemantic structure would serve a basis first of all for differentiated search processes and then for combining the information retrieved into some compressed semantic representation of the corpus, that is, BTF structure.

2. Stages of understanding in ILM and POLITEXT

We treat the term "understanding" as being nearly synonymous to the information analysis, i.e., as a sequence of operations that generate information from any given NLT (and for any given reader) with the desired completeness degree. This is a central point of the Information-Linguistic Model (ILM) designing the so called "soft understanding" in our approach.

So, the task of the lingware in the ILM is to ensure automatic acquisition of information from NLT-sources, preferably from short political messages in mass-media corpora.

According to the proposed information model, the understanding process passes through three stages. The first stage (local understanding) provides a complete linguistic analysis of all NLT sentences that includes all traditional levels of parsing and produces a sequence of structures for a given NLT - graphematic, morphological, syntactic and local semantic representations: GraphR, MorphR, SynR and LocSemR.

The second stage (global understanding) constructs global SemR as an internal description of the whole text in terms of text constituents, with the required degree of compression.

The third stage (relative understanding) produces external SemR, or InfoR (information representation), with units informative for a reader or any other perceiving intellectual system (for example, the current domain may be regarded as some external / perceiving system or external "point of view"). To satisfy the criterion those units being recognized or generated on the second stage are to be proved on the third stage as compatible with entries of database, knowledge base or simply with current domain lists.

As for the fourth stage, the translation of the BTF into English or into some other language is possible (we think that for this purpose it would be better to use some already existing system of text generation).

It is not in one or two steps that the text structure becomes the information or knowledge structure. Below (Table 1) is the generalized scheme of understanding stages, of structures and of analysis instruments in the system POLITEXT:

Types of understanding	Stages of analysis	Representations	Dictionaries involved
Local understanding within one sentence	GraphAn, MorphAn, SynAn, SemAn	GraphR, MorphR, SynR, Local SemR	Grammatical Dicts, Special linguistic Dicts, SemDict RUSLAN
Global understanding (inter-sentence analysis)	SemAn (for text or its paragraphs), Situational Analysis (SitAn)	Semantic Space, SitR, Global SemR	RUSLAN, Ling. Frames, Text Schemes
Relative (user-oriented or domain-oriented) understanding	Interpretation of SemR in terms of domain and user request	Special Representation, Textual Facts, BTF	Thesaurus, data bases, Special Frames
Understanding in terms of another language	Translation	English variants of BTF	Rus-Eng. Dictionaries, Eng. Dicts / Grammar for Text Generation

Table 1. Levels of text processing in the POLITEXT system

3. Linguistic vs. Information analysis

Stages of NLT-analysis within the system POLITEXT are regarded as those combining two types of processors:

a. linguistic analysis (LingAn),

b. information analysis (InfAn).

LingAn is a succession of bottom-up transformations which tend to build units maximally informative for a given text from a linguistic point of view.

InfAn is a succession of non-equivalent, top-down transformations of linguistic structures tending to extract or to reconstruct units significant for a given domain and a user. InfAn builds a set of structures in a task-specific mode.

Building of TF-units begins inside linguistic analysis and ends when generating new information.

The analysis is applicable to any Russian NLT with any given thesaurus adequate for this model and being tuned to domain analysis. All proper information processes must be carried out in the ILM under linguistic control (that is, they are "linguistically motivated"). We believe that very complex and of large scale information processes may be modeled on rather compact linguistic structure such as SemSpace.

4. The instruments of the system POLITEXT lingware

Automatic text processing is usually based on: 1) algorithms, grammars and programs of NLT-analysis; 2) the set (network) of dictionaries with linguistic and special domain knowledge, supporting all the stages of understanding; the network in our case includes:

a. a set of Russian dictionaries of single words supplied with morphological and syntactic information,

b. dictionary of canned phrases and collocations,

c. Russian Semantic Dictionary RUSLAN with English equivalents,

d. bilingual (Russian-to-English) Thesaurus on the modern life of Russia, special databases, (separate DBs or lists and their combinations).

Below we shall dwell upon the RUSLAN dictionary and its possibilities for the SemAn and Information Extraction (IE) tasks.

5. RUSLAN as a dictionary for Semantic and IE Analysis

The Russian general semantic dictionary (Russian abbreviation ROSS, then RUSLAN) is the main instrument for intra-, inter-sentence and whole text interpretation of the initial units, see (Leontyeva 2003; Semenova 2000). Its entries receive detailed descriptions as far as their syntactic and semantic properties are concerned. RUSLAN has a DB form and is entered into IBM PC by means of a special scenario in online mode (man-machine dialogue). This dictionary written in a special formalized language contains information on the semantic characteristics (taxonomy categories) of the lexeme described, its valencies, typical contexts, its thesauric relatives, derivational capabilities, collocations, English translations, domain of usage and some other data.

The criterion for entering a word into the RUSLAN DB is based mostly on the word's importance and frequency of occurrence in texts under consideration. The most informative or frequent nouns and verbs forming meaningful semantic groups turn to describe a situation, and that is why they are welcome in the dictionary. Meaningful adjectives and adverbs are presented there as well.

The dictionary RUSLAN was designed, first of all, for automatic construction of BTF for a given collection of Russian texts. As the main instrument of semantic analysis of coherent texts this dictionary bears the important information on how to build BTF-units starting from lexemes of SemDict entries. Recognizing and extracting data on “persons” and building corresponding semantic nodes are the initial steps along this path. Some data stored in this dictionary (e.g. on the word's syntactic valencies) are used at the pre-semantic stages of NLP.

The procedure proposed for RUSLAN-based semantic analysis should permit a user to form an individual BTF according to his/her request, when he/she orders the desired degree of compression of the initial content. Some zones (both linguistic and informational) of the dictionary would provide such possibilities.

To meet the task of recognizing information on persons some special lexicon was added there (Semenova 2001). The DB RUSLAN semantic dictionary comprises more than 11 000 entries at the moment. The lexicon will be described in some details below.

6. Personal lexicon in RUSLAN

It is obvious that the task of IE of the kind discussed in (Grishman 1999) and many other works should rely upon rather broad lexical data stored in a dictionary and revealing some social characteristics of persons: professions, titles, positions, ranks, traits, prosperities, etc. In this connection we have represented in the RUSLAN dictionary, in particular, about 3000 Russian animated nouns denoting persons in their social activities.

Now let us consider some peculiarities of the animated noun description. The field of semantic characteristics of the noun (that is looked upon as the principle field of the dictionary) makes a logical form of the set of standard dictionary descriptors such as POS (the position of a person), SPEC (the [narrow] specialty), SOCIAL (for something associated with the society), SOC-REL (social relation , e.g. used for enemy, son, friend, etc.), INTEL (for something intellectual, e.g. academician), POWER (for executive power bodies), ESTIM (for ethic and other estimations), and so on.

Besides, names of typical VIP-persons (president, monarch etc.) as well as names of social marginals (criminal, scoundrel etc.) are marked with the increased empirical information weight. The domain of the noun is also explicated: DOMAIN (painter) = "culture". There are some conventions on using descriptors for concrete lexical subclasses.

The typical valencies (denoted by binary relations) of the majority of nouns are those for a proper name: NAME (A,B) - (the famous Russian poet => Alexander Pushkin) and for the so-called sphere of activities expressed by different semantic classes and superficial features for different lexemes: SPHERE (A,B) - a researcher => in the field of genetics, the president => of a firm, a driver => of a bus. The corresponding zone of the dictionary enumerates predictable semantic characteristics of semantic actants, or "thematic roles" (J.Sowa), of the noun under consideration and their possible grammatical forms.

The lexical fields of the dictionary entity contain regular derivatives of the animated noun, for example, the female noun, the noun denoting the joint community (both are essential for the Russian language processing) as well as the lexical functions in terms of Mel'chuk's Meaning<=>Text Theory): Bon (commander) = vigorous), word combinations, synonyms, antonyms, free associations, some encyclopedic references (they show, for example, in what situations this word-meaning is used).

The dictionary reflects some ways of the meaning derivations significant for the animated lexicon, some regular and occasional polysemy (although the word division into lexemes is rather large-scale and rough, because of the necessity of automatic meaning recognition). Meanwhile different lexemes of the same word (if they are declared) are usually supplied with sharply dispersed contexts.

Those dictionary descriptions that, in particular, represent components of the lexeme semantic structure, conform to some extent to typical slots of the personal information frame, that is why they contribute into extraction data of the kind.

Besides the animated names, the RUSLAN lexicon includes some other data essential for personal information extraction, namely, verbs of social activities, adjectives and adverbs of personal appreciation. These classes are supplied with their own, "local" rules of description.

7. Not only RUSLAN

Other lexical resources needed for the task are lists (ex. of canned expressions and collocations) and some special DBs or lists: proper names, names of famous firms and institutes, toponymic data etc. Some of them are combined, ex. VIP and their positions: Anna Akhmatova - Russian poet of XX century, etc.). RUSLAN gives synonyms and derivatives (president - head, poet - poetess, Russia - state ..). For every class of units some individual traits of description have been elaborated using however the metalanguage of the system POLITEXT and based on the general RUSLAN dictionary conception.

Conclusion

Extraction of special information (about person in our case) may proceed in the course or after every step of text analysis mentioned above. Thus, GraphAn marks some words as possible names. Some simple context rules form the most probable chains of complex name(s) - of the kind A.B.Caplansky, A. and B. Caplan etc. MorphAn adds more exact features for animated nouns. SynAn builds syntactic relations that serve as additional filters for building proper noun phrases (NPs). The module of LocSemAn interprets syntactic nodes and relations and proposes a frame of a person, filling in semantic valency slots.

Next set of procedures is of the global character. Each new person-like NP or frame must be compared with already built ones - to justify right frame or to open (correct, split, reject and so on) some new frame.

One more verification might be done in the course of so called relative analysis. If the complex information is available in the system (e.g. that this person has the certain position in the certain span of time) then the comparison against such DB would support the hypothetic textual frame in its different variations.

The most reliable "personal frames" may be included in the intermediate semantic structure preceding BTF or in the BTF itself.

The similar approach was applied to detect information on organizations, on situations, on parameters, on references to another documents. The work was stopped when new commercial era came.

Acknowledgements

We are grateful to our small team which participated in the building of our experimental system POLITEXT: A.Sokirko, J.Anoshkina, N.Soushchanskaya, M.Shatalova, students of Russian State Humanitarian University. This work has been supported by some organizations or foundations (Macarthur Foundation, Soros Foundation, Russian Foundation of Fundamental Research). The investigations on the dictionary RUSLAN-1 (an independent lexical resource) are supported at the moment by the Russian State Humanitarian Foundation (01-04-16252a).

References

(Dobrov et al. 1998) Dobrov B. V., Loukashevitch N. V., Yudina T. N., Conceptual Indexing Using Thematic Representation of Texts, In: Information Technology: The Sixth Text Retrieval Conference (TREC6) // Ed. E. M. Voorhees, D. K. Harman – NIST SP 500-240, 1998 — pp.403-413.

(Grishman 1999) Ralph Grishman, Information Extraction: Techniques and Challenges // pdf-file, 1999.

(Leontyeva 1995) Leontyeva N. N. System POLITEXT: Information analysis of Political Texts // NTI, vol. 2, N 5, 1995. pp. 4-17.

(Leontyeva et al. 2001) Leontyeva N. N., Semenova S. Yu. Instruments for building PERSON-frame // NTI, vol. 2, N 8, 2001. – pp. 9-20.

(Leontyeva 2003) Leontyeva Nina N. RUSLAN as a Semantic Dictionary for Information Extraction. // RANLP materials 2003.

(Loukashevitch et al. 2000) Loukashevitch N., Dobrov B. Thesaurus-Based Structural Thematic Summary in Multilingual Information Systems // Machine Translation Review, Issue No. 11, December 2000 — pages 10-20 (http://www. bcs. org. uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-8. htm).

(Semenova 2000) Semenova S. Yu. Word Taxonomy Fields of one Russian General-Purpose Semantic Dictionary: Descriptor Selection, Analysis of Representation Possibilities. // DIALOG-2000. – V2.-pp. 308-316.

(Semenova 2001) Semenova S. Yu. On a lexical approach to social person portrait representation. // DIALOG-2001. – V.1. – pp. 232-241.

Nina N. Leontyeva, S. Semenova, Lexical Resources for Personal Data Recognition in Texts // Workshop on Information Extraction for Slavonic and other Central and Eastern European Languages (IESL-2003). Borovets, Bulgaria, 2003.