Improving the Performance of a Tagger Generator in an Information Extraction Application

José Troyano; Fernando Enríquez; Fermín Cruz; José Cañete-Valdeón; F. Ortega

doi:10.3217/jucs-013-09-1287

JUCS - Journal of Universal Computer Science 13(9): 1287-1299, doi: 10.3217/jucs-013-09-1287

Improving the Performance of a Tagger Generator in an Information Extraction Application

José A. Troyano^‡, Fernando Enríquez^‡, Fermín Cruz^§, José M. Cañete-Valdeón^§, F. Javier Ortega^§

‡ University of Seville, Spain§ University of Seville, Seville, Spain

Corresponding author: José Troyano ( troyano@us.es )

This article is freely available under the J.UCS Open Content License.

Citation: Troyano JA, Enríquez F, Cruz F, Cañete-Valdeón JM, Ortega FJ (2007) Improving the Performance of a Tagger Generator in an Information Extraction Application. JUCS - Journal of Universal Computer Science 13(9): 1287-1299. https://doi.org/10.3217/jucs-013-09-1287

Abstract

In this paper we present an experience in the extraction of named entities from Spanish texts using stacking. Named Entity Extraction (NEE) is a subtask of Information Extraction that involves the identification of groups of words that make up the name of an entity, and the classification of these names into a set of predefined categories. Our approach is corpus-based, we use a re-trainable tagger generator to obtain a named entity extractor from a set of tagged examples. The main contribution of our work is that we obtain the systems needed in a stacking scheme without making use of any additional training material or tagger generators. Instead of it, we have generated the variability needed in stacking by applying corpus transformation to the original training corpus. Once we have several versions of the training corpus we generate several extractors and combine them by means of a machine learning algorithm. Experiments show that the combination of corpus transformation and stacking improve the performance of the tagger generator in this kind of natural language processing applications. The best of our experiments achieves an improvement of more than six percentual points respect to the predefined baseline.

Keywords

named entity extraction, corpus transformation, system combination, stacking