JUCS - Journal of Universal Computer Science 19(3): 383-405, doi: 10.3217/jucs-019-03-0383

Text Representation for Efficient Document Annotation

Christin Seifert^‡, Eva Ulbrich^§, Roman Kern^§, Michael Granitzer

‡ Passau University, Passau, Germany§ Know-Center Graz, Graz, Austria

Corresponding author: Christin Seifert ( christin.seifert@uni-passau.de )

This article is freely available under the J.UCS Open Content License.

Citation: Seifert C, Ulbrich E, Kern R, Granitzer M (2013) Text Representation for Efficient Document Annotation. JUCS - Journal of Universal Computer Science 19(3): 383-405. https://doi.org/10.3217/jucs-019-03-0383

Abstract

In text classification the amount and quality of training data is crucial for the performance of the classifier. The generation of training data is done by human labellers - a tedious and time-consuming work. To reduce the labelling time for single documents we propose to use condensed representations of text documents instead of the full-text document. These condensed representations are key sentences and key phrases and can be generated in a fully unsupervised way. We extended and evaluated the TextRank algorithm to automatically extract key sentences and key phrases. For representing key phrases we propose a layout similar to a tag cloud. In a user study with 37 participants we evaluated whether document labelling with these condensed representations can be done faster and equally accurate by the human labellers. Our evaluation shows that the users labelled tag clouds twice as fast and as accurately as full-text documents. While further investigations for different classification tasks are necessary, this insight could potentially reduce costs for the labelling process of text documents.

Keywords

document labelling, tag clouds, word clouds, text summarisation, data mining, supervised learning, TextRank