JUCS - Journal of Universal Computer Science 18(5): 623-649, doi: 10.3217/jucs-018-05-0623

Improving the Extraction of Text in PDFs by Simulating the Human Reading Order

Ismael Hasan^‡, Javier Parapar^§, Álvaro Barreiro^|

‡ University of a Coruña, Coruña, Spain§ University of A Coruña, Coruña, Spain| University of A Coruña, A Coruña, Spain

Corresponding author: Ismael Hasan ( ihasan@udc.es )

This article is freely available under the J.UCS Open Content License.

Citation: Hasan I, Parapar J, Barreiro Á (2012) Improving the Extraction of Text in PDFs by Simulating the Human Reading Order. JUCS - Journal of Universal Computer Science 18(5): 623-649. https://doi.org/10.3217/jucs-018-05-0623

Abstract

Text preprocessing and segmentation are critical tasks in search and text mining applications. Due to the huge amount of documents that are exclusively presented in PDF format, most of the Data Mining (DM) and Information Retrieval (IR) systems must extract content from the PDF files. In some occasions this is a difficult task: the result of the extraction process from a PDF file is plain text, and it should be returned in the same order as a human would read the original PDF file. However, current tools for PDF text extraction fail in this objective when working with complex documents with multiple columns. For instance, this is the case of official government bulletins with legal information. In this task, it is mandatory to get correct and ordered text as a result of the application of the PDF extractor. It is very usual that a legal article in a document refers to a previous article and they should be offered in the right sequential order. To overcome these difficulties we have designed a new method for extraction of text in PDFs that simulates the human reading order. We evaluated our method and compared it against other PDF extraction tools and algorithms. Evaluation of our approach shows that it significantly outperforms the results of the existing tools and algorithms.

Keywords

PDF, text preprocessing, text extraction, ordered text extraction, text mining, evaluation