A statistical approach to term extraction
Abstract
This paper argues in favor of a statistical approach to terminology extraction, general to all languages but with language specific parameters. In contrast to many application-oriented terminology studies, which are focused on a particular language and domain, this paper adopts some general principles of the statistical properties of terms and a method to obtain the corresponding language specific parameters. This method is used for the automatic identification of terminology and is quantitatively evaluated in an empirical study of English medical terms. The proposal is theoretically and computationally simple and disregards resources such as linguistic or ontological knowledge. The algorithm learns to identify terms during a training phase where it is shown examples of both terminological and non-terminological units. With these examples, the algorithm creates a model of the terminology that accounts for the frequency of lexical, morphological and syntactic elements of the terms in relation to the non-terminological vocabulary. The model is then used for the later identification of new terminology in previously unseen text. The comparative evaluation shows that performance is significantly higher than other well-known systems.Downloads
The works published in this journal are subject to the following terms:
1. The Publications Services at the University of Murcia (the publisher) retains the property rights (copyright) of published works, and encourages and enables the reuse of the same under the license specified in item 2.
2. The works are published in the electronic edition of the magazine under a Creative Commons Attribution Non-commercial Share Alike 4.0.
3.Conditions of self-archiving. Authors are encouraged to disseminate pre-print (draft papers prior to being assessed) and/or post-print versions (those reviewed and accepted for publication) of their papers before publication, because it encourages distribution earlier and thus leads to a possible increase in citations and circulation among the academic community.
RoMEO color: green