Exploring Information Extraction Resilience

Dawn Gregg

doi:10.3217/jucs-014-11-1911

JUCS - Journal of Universal Computer Science 14(11): 1911-1920, doi: 10.3217/jucs-014-11-1911

Exploring Information Extraction Resilience

Dawn G. Gregg^‡

‡ University of Colorado, Denver, United States of America

Corresponding author: Dawn Gregg ( dawn.gregg@cudenver.edu )

This article is freely available under the J.UCS Open Content License.

Citation: Gregg DG (2008) Exploring Information Extraction Resilience. JUCS - Journal of Universal Computer Science 14(11): 1911-1920. https://doi.org/10.3217/jucs-014-11-1911

Abstract

There are many challenges developers face when attempting to reliably extract data from the Web. One of these challenges is the resilience of the extraction system to changes in the web pages information is being extracted from. This article compares the resilience of information extraction systems that use position based extraction with an ontology based extraction system and a system that combines position based extraction with ontology based extraction. The findings demonstrate the advantages of using a system that combines multiple extraction techniques, especially in environments where web sites change frequently and where data collection is conducted over an extended period of time.

Keywords

information extraction, semi-structured data, ontologies