109

urn:lsid:arphahub.com:pub:3dc5f44e-8666-58db-bc76-a455210e8891

JUCS - Journal of Universal Computer Science

jucs

0948-695X 0948-6968

Journal of Universal Computer Science

10.3217/jucs-022-06-0760

23272

Research Article

E.1 - DATA STRUCTURES H.0 - GENERAL H.4 - INFORMATION SYSTEMS APPLICATIONS M.1 - KNOWLEDGE ENGINEERING METHODOLOGIES

A Proposal for Recommendation of Feature Selection Algorithm based on Data Set Characteristics

Goswami

Saptarsi

saptarsi007@gmail.com 1 Chakrabarti

Amlan

2 Chakraborty

Basabi

Institute of Engineering and Management, Kolkata, India

Institute of Engineering and Management

Kolkata

India 2

Calcutta University, Kolkata, India

Calcutta University

Kolkata

India 3

Iwate Prefectural University, Takizawa, Japan

Iwate Prefectural University

Takizawa

Japan

Corresponding author: Saptarsi Goswami (saptarsi007@gmail.com).

Academic editor:

2016

01 06 2016

22 6 760 781 CCA9137D-EE21-5274-A20C-094E39CF26BC 5505241 30 11 2015 28 05 2016

Saptarsi Goswami, Amlan Chakrabarti, Basabi Chakraborty

This article is freely available under the J.UCS Open Content License.

Abstract

Feature selection is an important prerequisite of any pattern recognition, machine learning or data mining problem. A lot of algorithms for feature subset selection have been developed so far for reduction of dimensionality of the data set in order to achieve high recognition accuracy with low computational cost. However, some methods or algorithms work well for some of the data sets and perform poorly on others. For any particular data set, it is difficult to find out the most suitable algorithm without some random trial and error process. It seems that the characteristics of the data set might have some effect on the algorithm for feature selection. In this work, the data set characteristics is studied for recommendation of appropriate feature selection algorithm to be used for a particular data set. A new proposal in terms of intra attribute relationship and a measure MVS (multivariate score) has been introduced to quantify and group different data sets on the basis of the data set correlation structure into several categories. The measure is used to group 63 publicly available bench mark data set according to their characteristics. The performance of different feature selection algorithms on different groups of data are then studied by simulation experiments to verify the relationship o f data set characteristics and the feature selection algorithm. The effect of some other data set characteristics has also been studied. Finally a framework of recommendation regarding the choice of proper feature selection algorithm has been indicated.