Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://www.adass.org/adass/proceedings/adass98/leej/
Дата изменения: Sat Jul 17 00:46:34 1999 Дата индексирования: Tue Oct 2 04:29:00 2012 Кодировка: Поисковые слова: hst |
Controlled indexing vocabularies overcome the variation in authors' natural language by providing standardized labels for concepts. They also permit searchers to limit their queries to the most important ideas or topics in a document. The mixture of different descriptor vocabularies in ADS defeats the standardization goal, and the merging of the abstract and keyword indexes limits the search precision function of the subject indexing. Descriptors representing identical concepts (or closest counterpart) can stand in several different relationships to each other. For example, counterpart terms may be synonyms (e.g., Andromeda Galaxy and Galaxies: Individual Messier Number: M31). A descriptor in one vocabulary may be a pre-coordinated combination of terms in the other vocabulary (e.g., Microwave Background Radiation vs. Background Radiation and Microwaves).
An ongoing project at the University of Illinois investigates sources of evidence to support the automatic and/or computer-assisted reconciliation of the heterogeneous indexing in ADS. Two sources of evidence have been investigated: lexical resemblance between descriptors and consistent assignment of descriptors from different vocabularies to the same documents. Use of lexical resemblance evidence is discussed in an earlier paper (Dubin 1998).
We developed a spreading activation model, similar to those employed for modeling human associative memory (Collins & Loftus 1975). In our model, the network is composed of three layers: an input term layer, document layer, and output term layer. The activation of terms in the input layer is spread through the network to the connected documents and from there to the output terms. As a result, this model produces a list of terms with their activation levels representing the degree of relatedness to the input term(s) (Lee 1998).
Calculation of the activation level employs a simple weighting rule. The weight assigned to the link between a document and a term is determined by the number of connections and the direction of activation. We use the conservation of activation principle which guarantees that the sum of input activation equals to that of output activation. The activation received by a document from input terms is computed as follows:
In these formulas, Si is the activation of an input term i; Dj is the activation of a document j; Tk is the activation of an output term k; wij is a weight between an input term i and a document j; and wjk is a weight between a document j and an output term k.
Experimental materials included two sets of documents, one indexed by 10,200 ApJ terms, and the other by 3,335 STI terms. The ApJ set contained 39,366 documents, and the STI set 22,139 documents. Among these, 14,956 documents were identified as co-indexed. The merged network representation included 4,120 STI term nodes in one term layer, 14,956 document nodes in the middle layer, and 2,305 ApJ term nodes in the other term layer. Two spreading activation models were constructed according to the direction of activation (from source term to target term): ApJ STI and STI ApJ.
We adapted the so-called ``Mexican-Hat'' function for our cutoff criterion. The Mexican-Hat is the second derivative of the Gaussian curve, and has been successfully used for vision processing, especially edge detection (Charniak & McDermott 1987). The set of output activation levels was convolved with the second derivative of the Gaussian in order to find the point where the slope of the activation value distribution drops most dramatically. Only terms above this cutoff point were selected as mapping terms. The average number of terms above the cutoff was 1.9 out of 135 activated terms for STI ApJ, and 1.6 out of 354 activated terms for ApJ STI.
Term mappings identified by the spreading activation model include a variety of term-to-term relationships. They ranges from simple spelling variants to complicated semantic factoring. We were encouraged to observe that the model identified many of the same connections uncovered in our lexical resemblance study, although the current model employs no lexical evidence. Examples of relationships apprehended with the network included:
We are evaluating the usefulness of the spreading activation output as evidence for vocabulary merging. We plan a user study in which expert astronomers evaluate the connections suggested by the model. We will follow up with a user evaluation test, assessing the impact of mergings on search precision. We're also evaluating the scalability of the model with larger databases.
We are developing a visualization tool in order to help users understand the complex nature of term relationship. It will show term mapping structure among participating vocabularies using a graphical interface. A directed graph method will be used to represent the directional information of term relationship.
Charniak, E. & McDermott, D. V. 1987 Introduction to Artificial Intelligence, (Reading: Addison-Wesley), 101
Collins, A. M. & Loftus, E. F. 1975, Psych. Rev., 82, 407
Dubin, D. S. 1998, in ASP Conf. Ser., Vol. 153, Library and Information Services in Astronomy III, ed. U. Grothkopf, H. Andernach, S. Stevens-Rayburn, & M. Gomez, (San Francisco: ASP), 77
Eichhorn, G., Accomazzi, A., Grant, C. S., Kurtz, M. J., & Murray, S. S. 1998, in ASP Conf. Ser., Vol. 145, Astronomical Data Analysis Software and Systems VII, ed. R. Albrecht, R. N. Hook, & H. A. Bushouse (San Francisco: ASP), 378
Lee, J. 1998, A Theory of Spreading Activation for Vocabulary Merging, (Grad. Sch. of Library and Information Sci., Univ. of Illinois, unpublished report), (Champaign: Univ. of Illinois)