JUCS - Journal of Universal Computer Science 25(4): 334-360, doi: 10.3217/jucs-025-04-0334

Data-driven Feature Selection Methods for Text Classification: an Empirical Evaluation

Rogerio C. P. Fragoso^‡, Roberto H. W. Pinheiro^§, George Cavalcanti

‡ Universidade Federal de Pernambuco, Recife, Brazil§ Universidade Federal do Cariri, Juazeiro do Norte, Brazil

Corresponding author: Rogerio C. P. Fragoso ( rcpf@cin.ufpe.br )

This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY-ND 4.0). This license allows reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. The license allows for commercial use.

Citation: Fragoso RCP, Pinheiro RHW, Cavalcanti G (2019) Data-driven Feature Selection Methods for Text Classification: an Empirical Evaluation. JUCS - Journal of Universal Computer Science 25(4): 334-360. https://doi.org/10.3217/jucs-025-04-0334

Abstract

Dimensionality reduction is a crucial task in text classification. The most adopted strategy is feature selection using filter methods. This approach presents a difficulty in determining the best size for the final feature vector. At Least One FeaTure (ALOFT), Maximum f Features per Document (MFD), Maximum f Features per Document-Reduced (MFDR) and Class-dependent Maximum f Features per Document-Reduced (cMFDR) are feature selection methods that define automatically the number of features per Corpus. However, MFD, MFDR, and cMFDR require a parameter that defines the number of features to be selected per document. Automatic Feature Subsets Analyzer (AFSA) is an auxiliary method that automates such configuration. In this paper, we evaluate dimensionality reduction, classification performance and execution time of this family of methods: ALOFT, MFD, MFDR, cMFDR and AFSA. The experiments are conducted using three feature evaluation functions and twenty databases. MFD obtained the best results among the feature selection methods. In addition, the experiments showed that the use of AFSA does not significantly affect the classification performances or the dimensionality reduction rates of the feature selection methods, but considerably reduces their execution times.

Keywords

text classification, feature selection