Semantic Dictionary for Text Understanding and Summarization

Nina N. Leontyeva

Abstract

The Semantic Dictionary RUSLAN-1, the last version being Russian-to-English direction, is a tool for semantic and informational analysis of any coherent Russian text. The rich semantic information contained in the dictionary makes possible local, within one phrase, semantic interpretation as well as semantic analysis of coherent texts. Some zones of the dictionary provide the possibility of logico-situational analysis of a text and a link to different domains. RUSLAN implies a clear-cut distinction between syntactic and semantic levels of text structure. Moreover, semantics itself is split into several sublevels, making possible very important information processes in a system of text understanding, such as tuning to the required domain, shifting the focus of interest, compressing the initial text content, etc. At the moment the dictionary accounts about 10000 entries. Entering semantic descriptions in dBase form you receive also the textual file, and vice versa. The dictionary may be called from any program dealing with syntactic, semantic and pragmatic analysis of Russian texts, so it has a status of reusable dictionary resource.

Introduction

No serious systematic text processing would be possible without a full employment of the Lexicon as a whole. When computer systems faced the severe necessity to provide the description of the semantic component, the traditional linguistics fell short of suitable semantic theories. The most vulnerable area proved to be that of interaction between linguistic structures and Data or Knowledge Bases. Usually this problem would be solved by computer scientists, albeit scholars in the field of detailed linguistic analysis made a successful attempt to "crush through" to the Knowledge Representation (Apresjan et al. 1992; Boguslavskij, Tsinman 1992).

A provoking picture of computational semantic dictionary was proposed in (Pustejovsky 1992), the central idea being that "word meaning is highly structured, and not simply a set of semantic features." Two tasks – "a fully compositional semantics for natural language and its interpretation into a knowledge representation model", as well as "the mapping from the lexicon to syntax" – are considered to be the most important. We see some correlation between these ideas and our approach to syntactic/semantic levels of description. As for correlations with other semantic dictionaries we would mention Explanatory Combinatorial Dictionary of Modern Russian (Mel’c’uk – Zholkovsky 1984) and its variations used for machine translation systems (Shalyapina 1974; Leontyeva et al. 1979; dictonaries of MT-systems ETAP, see in (Apresjan et al. 1992)), dictionaries for Knowlege-Based Machine Translation – KBMT (Nirenburg 1989; Papagaaij 1986 and some others).

1. The purpose of RUSLAN

The present version of the RUSLAN is designed, firstly, to construct a Base of Textual Facts (BTF) for a given collection of Russian texts [Leontyeva 1992]. The procedure proposed for BTF construction enables any user to form an individual BTF tuning some dictionary fields (WEIGHT and EVENT, see below) to his informational interests and to desired degree of compression of the initial text content. Secondly, the dictionary is to ensure multi-language, in particular, Russian-to-Eglish KBMT. Though similar in many aspects to dictionaries with rich semantic data belonging to some MT systems RUSLAN differs from them essentially. RUSLAN as well as the main semantic dictionary of the FRAP system (French-to-Russian automatic translation implemented by 1985, see (Machine Translation 1987), described also in (Hutchins 1986), implies a clear-cut distinction between the syntactic and semantic levels of text representation. Moreover, semantics itself is split into several sublevels, making possible very important information processes in a system of text understanding, such as tuning to the required domain, shifting the focus of interest, compressing the initial text content, etc. The nature of semantic information in RUSLAN will be clearer if we outline the stages in the understanding of natural language texts (NLT) , that is, the linguistic model underlying both text analysis and RUSLAN as its instrument.

2. Stages of the NLT understanding

Text understanding is very closely related to information analysis (Leontyeva 2000). The text understanding process is a sequence of operations that generate information with the desired degree of completeness from any given NLT for any given reader. We distinguish three major steps in this process: 1) linguistic analysis of all NLT sentences, which includes all traditional levels of parsing and produces a sequence of representations for a given NLT – graphematic, morphological, syntactic, local semantic: GraphR, MorphR, SynR, local SemR, the last one has the form of initial semantic space (SemSpace) with gaps and other kinds of semantic "invalidity" explicitly expressed . This step (or stage) may be characterised as "local understanding" of a text; 2) construction of an internal textual representation in terms of situational (SIT) structure or textual fact (TF) constituents. The construction of SIT or TF units is a kind of a “strategic” component of text generation systems in the sense of (McKeown 1985, Iordanskaja – Polguere 1988, Hovy 1988) et al. In our case it begins within analysis and is accompanied with deletion of some redundent or unimportant parts of SemSpace. This second step of text analysis is "global understanding": one SIT may be gathered from the lexical material of the whole text and represented in the global SIT` structure with the required degree of compression; 3) transfer of the SIT structure into information structure in terms of the given domain and/or according to the user's request or information interest. The global SIT structure has to be matched against the dictionary of the chosen domain objects and "rewritten" in terms familiar to the given addressee of information (that is, "matched" against and tuned to the lexicon of the user). It may be focused around the given list of notions as well. This third stage produces TF-units that are valid, informative for the reader or any other perceiving intellectual system. We therefore call it "relative understanding".

We do not consider the process of translating text into another NLT as one more level of text understanding. The final representation of every above level could be translated into some another NLT, another dBase or even into units of another semiotic system (e.g. the "language of actions"), – the results of translation may rather show what kind of understanding has occured: local, global or specialized (relative).

Building the TF structure or the preceding structure – Situational representation (SitR) – would ensure translating them into another natural language in the mode of KBMT. We suppose that translating this way is more achievable than any "full translation", not to mention the fact that the user needs not the finest information from the text, but only essence (cf. Shank's approach).

The dictionary under consideration aims at ensuring those and similar intentions.

3. The Structure of RUSLAN

The structure of RUSLAN is hierarchical: the lower level comprises fields that assume concrete values. The higher level is represented by zones, i.e. names of groups of fields. The present version of the dictionary contains 10 zones, comprising over 50 fields. The list of zones is given below.

GEN: General information on the word (C) to be described;

MORPH: Morphological characteristics, including the word-forming potential;

SYN: Syntactic properties of the word C (syntactic class, specific syntactic structures, grammatical forms of complements);

LEX: Lexical combinability, cliches, idioms, lexical functions;

SEM: Semantic characteristics of C, semantic valencies, hypothetic realisation of valencies within one phrase and through the whole text. Lexical functions for expressing actants. Corrections and other semantic operations on the C-containing primary semantic representation;

THES: Thesauric links of C as a concept within a certain domain, C-containing terms, explicit definitions of concepts, encyclopedic functions;

SIT: Structure of situations related to C and relevant to givendomain, relationships among situations, with particular emphasis on relations of temporal precedence and consequence;

PRAGM: Pragmatics of the situation in the domain: events, inferences, presuppositions, evaluation (weight) of the event or its parts;

EQUIV: Equivalents of the initial lexeme (and terms within the entry) into other languages;

COMM: Comments of the lexicologist (+ name) - in a free form.

We would have to introduce some notations – english equivalents of formal apparatus in Russian; we will limit ourselves to minimal necessary explanations.

4. Categorization of lexemes

The upper level of semantic classification in Russian general semantic dictionary RUSLAN is semantic category (CAT) of lexeme related to the way it is represented in the semantic notation. This field can acquire the following main values:

LBL – word-labels, take up position A or B in the semantic relation R(A,B). The words of this category have the maximal initial information weight. This is the largest, open and mobile class that must cover the lexical nucleus of the chosen domain. LBLs form two subcategories: SITuation (war, discussion) and OBJect (book, President, bill). Other words-labels yeild simple LBL symbol (problem). Further semantic differentiation is derived from fields SEMF (semantic features) and VAL (valencies), see below.

REL – relations (or semantic relations, SemRel) occupy position R in the formula R(A,B). These words have lower, as compared to LBL, initial informational weight since they denote links (relations) between units of other categories. In the majority of cases, REL is ascribed to auxiliary parts of speech (prepositions, conjunctions, punctuation marks). Nouns that coincide with the name of some semantic relations (cause, part) fall into the category of aspect words; they become relations only at the next step of analysis.

ASP – aspect words: the lexeme defines the name of the semantic relation and fills in its first place in the formula R(A,B): ASPECT (size,B). Aspect words are inhomogeneous both grammatically (nouns, adjectives, verbs, adverbs) and semantically. They can acquire in SemF and VAL fields of SEM zone more specific relations: PARAMETER (size,B); PART_OF(member,B) as well as STAGE (A,B), MODALITY(A,B), IDENTIFIER (A,B), FUNCTION (A,B), NAME (A,B) etc. They have lower informational weight than LBLs.

We do not describe at the moment so called “words-operators”. This category comprises pronouns (he, they, etc.), particles (even, only), adverbs (respectively, particularly, contrary), quantors (every, all), etc. These units operate on SemSpace structure at the stage of proper (global) semantic analysis when there already exist units that can become arguments of the semantic relations being introduced. The informational weight of these lexemes given in the dictionary is minimal, but the transformations they start may change the informational weight of term B in the semantic formula.

ELSE – a quasi-category for unclear cases.

The most important categories are those that form nuclea of semantic nodes and those that become semantic relations between them. We emphasize the very important role of words-relations in texts. They constitute the structural frame (in the form of binary semantic relations – SemRel’s) of any text semantic representation. They originate from different auxilliary words – prepositions, conjunctions, from some grammar categories (of verb, of superficial Russian cases) as well as from meaninful words (precede, belong, cause, part). SemRel’s may be general (common for every text) or domain specific.

It's not so easy to ascribe category to some lexemes, especially to verbs. Normally they are interpreted as names of actions or situations, thus being nodes of SemR. But very often they transfer into another category: become relations, or aspect of node, or make only part of semantic nodes as words of LF type .

5. Some semantic zones and fields

SemF – semantic features (ex.: “thing, animate, process, intellectual, bad”, etc.; lexical functions – LF – are used in this field as well). Ex.: ENTRY = bank; SemF = ORG, FINance. ENTRY = okazyvat’; SemF = LF Oper.

VAL – a set of semantic valencies of the word; candidates for filling in the slots of valencies are introduced by symbol Ai (i =1,2...7). The form of notation in VAL field: R(Ai,C) or R(C,Ai). The notation R,Ai,C; R,C,Ai is also possible. Ex.: ENTRY = message; VAL = agent, A1,C; addressee,A2,C; topic,A3,C; content,A4,C.

Further on each valency is described separately in the form of “SemF1”, “SemF2” etc. (abbreviations of “SemF of A1”, “SemF of A2” etc.; the same applies for grammatical characteristics: GramF1, GramF2, etc.).

ADD – additional semantic relations among the actants. Ex.: ENTRY = compensation; VAL=agent,A1,C; addressee,A2,C; cause,A3,C; value,A4,C; ADD = 1. patient, A2,A3; 2. belongs_to, A4,A2. Ex.: compensation to NN (A2) for the damage (A3); the value of compensation belongs to NN.

CORR – the rules of correction of the valency structure written in the form: initial SemRel, => (symbol of transition), resulting SemRel / if…(condition); Ex.: ENTRY = ruin; VAL = agent, A1, C; passive_actant, A2, C. CORR = agent, A1,C, =>, cause, A1,C / SemF1=non-animated; e.g. the flood (A1) has ruined the village (A2). The flood is announced CAUSE instead of AGENT of a situation ‘ruin’ by this rule.

RESTOR – the rules for reconstructing a member of the valent structure, in particular, the agent of an action if being expressed by infinitive.

In the zone SIT the most important fields are:

ESit – description of elementary situations in the form of a set of semantic relations R,A,B, e.g. ENTRY = export; VAL = agent, A1,C; ob, A2,C; end_point, A3,C;

ESit = 1. belongs_to, A2,A1 / SEMF1 = organisation; 2. loc, A1,A2 / SEMF1 = space, state; 3. belongs_to, A2,A3; 4. loc, A3,A2; Further on, elementary situations are referred to as ES1, ES2, etc.

PRECED – elementary situation preceding main Sit (C); POST – elementary situation following main Sit, e.g. ENTRY = export; PRECED = ES1 or ES2; POST = ES3 or ES4.

In the zone PRAGM the important fields are: WEIGHT – initial semantic importance of lexeme. Ex.: war 5, start 4; nice 3, etc.

EVENT – event (main situation C denoted by the word and/or one of its actants with the greatest informational weight that may be a nucleus of some event in the indicated domain), e.g. EVENT = A3 (actant A3 of C is announced to be the center of TF to be built); Ex.: ENTRY = to help; VAL = agent,A1,C; addressee,A2,C; content,A3,C; reason,A4,C; EVENT = A3;

INFER – standard inference in the form of a production rule: If (SIT1), then (SIT2); if (SIT2) then NON (SIT3);

PRESUP – presupposition (names of situations already introduced in some field, or being formulated by the linguist, that are indispensable for C to be true);

EVAL – evaluation ("+", "-", "0", or "?", the latter signifies that the evaluation depends upon certain conditions), e.g. ENTRY = export; EVAL = ? / (�2) (evaluation of C is inherited from A2): EVAL(to export drugs) = “-“ (bad).

LOG – more complicated situations that characterise logic of the event and are to be formulated in terms of SIT and production rules.

6. Implementations

The first version (French-to-Russian) of semantic dictinary has been the part of semantic analysis in MT system FRAP, then the Russian-to-English version was developed and included in the text understanding system "POLIText". The structure of RUSLAN reflects the philosophy and levels of representation adopted in the information-linguistic model (ILM) described in [Leontyeva 2000, 2001]. Some modifications were made to include a part of that dictionary (Russian and English parts separately) into Russian-to-English MT system [Sokirko 2001]. Similar Russian part of the system was used as the main instrument of semantic interpretation of texts in the Internet network version [www.aot.ru]. Bulgarian Academy of sciencies began works on the bulgarian simplified version of the dictionary. At the moment RUSLAN-1 is being implemented and filled in the Research Computing Center of Moscow State University, it accounts about 10000 semantic entries . It may be called from any program dealing with syntactic, semantic and pragmatic analysis of Russian texts, so it has a status of reusable dictionary resource.

Aknowledgements

Our thanks go to everybody who participated in advancing the dictionary. Describing and entering the Russian semantic dictionary into data base is rather labour-consuming work. This work is being done by a team of 2-5 lexicologists from two humanitarian Universities and two academic Institutes. From time to time it has been supported by some organisation or foundations (among them were Macarthur Foundation, Soros Foundation, Center of Information Research, two Russian Foundations). At the moment dictionary works are supported by Russian State Humanitarian Foundation (RGNF: 01-04-16252a).

References

Apresjan Yu.D., Boguslavsky I.M., Iomdin L.L. et al. (1992) Lingvistichesky processor dl'a slozhnyh informacionnyh sistem.- M.: Nauka.

Boguslavskij I.M., Tsinman L.L. Semantics in a linguistic processor (1991) Computers and Artif.Intelligence. n. 3.

Hovy E.F. (1988) On the study of text planning and realization AAAI workshop on text planning. St.Paul.

Hutchins W.J. (1986) Machine Translation: Past, Present, Future. - England.

Iordanskaja L.N., Polguere A. (1988) Semantic processing for text generation Proc. of the Intern. comput. sci. conf. Hong Kong.

Leontyeva N.N., Kudryashova I.M., Sokolova E.G. (1979) Semanticheskaja slovarnaja statya v sisteme FRAP. PGPAEPL. In-t russkogo jazyka AN SSSR. Vyp. 121. M.

Leontyeva N. N. (1992) Textual Facts as Units of Coherent Text Semantic Analysis International Workshop on the Meaning-Text Theory. Ed. Karen Haenelt and Leo Wanner. Darmstadt. July.

Leontyeva N.N K teorii avtomaticheskogo ponimanija jestestvennyh textov. Part 1 (2000): Informacionno-Lingvisticheskaja Model’. Part 2 (2001): Semanticheskije slovari: sostav, struktura, metodika sozdanija Izd. MGU, M.

Machine Translation (1987) and Applied Linguistics. Problems related to the development of automatic translation systems. Issue 271.- Moscow's Mauris Thorez State Institute of Foreign Languages. - M.

McKeown Kathleen R. (1985) Discourse Strategies for Generating Natural-Language Text. Artificial Intelligence 27.

Mel’c’uk Igor and Zholkovsky Alexander (1984) Explanatory Combinatorial Dictionary of Modern Russian Wiener Slawistischer Almanach, Vienna.

Nirenburg S. (1989) Knowlege-Based Machine Translation Machine Translation. -4, N 1.

Papagaaij B.C. (1989) Word Expert Semantics. An Interlingual Knowledge-Based Approach Distributed Language Translation Ed. Toon Witkam. - Dortrecht: Reverton.

Pustejovsky James (1992). The Generative Lexicon Computational Linguistics.Vol.17, N4.

Shalyapina Z.M. (1974) Anglo-russky mnogoaspektny avtomayichesky slovar' (ARMAS) Mashinny perevod i prikladnaja lingvistika. Vyp. 17. - M.: MGPIIJA.

Sokirko A.V. (2001) Semanticheskije slovari v avtomaticheskoj obrabotke texta Avtoref. kand. dissertation. Moskva.

Nina N. Leontyeva, Semantic Dictionary for Text Understanding and Summarization // International Journal of Translation. New Dehli. 2003. Vol. 15. ? 1. �. 107-114.