JUCS - Journal of Universal Computer Science 22(6): 760-781, doi: 10.3217/jucs-022-06-0760
A Proposal for Recommendation of Feature Selection Algorithm based on Data Set Characteristics
expand article infoSaptarsi Goswami, Amlan Chakrabarti§, Basabi Chakraborty|
‡ Institute of Engineering and Management, Kolkata, India§ Calcutta University, Kolkata, India| Iwate Prefectural University, Takizawa, Japan
Open Access
Feature selection is an important prerequisite of any pattern recognition, machine learning or data mining problem. A lot of algorithms for feature subset selection have been developed so far for reduction of dimensionality of the data set in order to achieve high recognition accuracy with low computational cost. However, some methods or algorithms work well for some of the data sets and perform poorly on others. For any particular data set, it is difficult to find out the most suitable algorithm without some random trial and error process. It seems that the characteristics of the data set might have some effect on the algorithm for feature selection. In this work, the data set characteristics is studied for recommendation of appropriate feature selection algorithm to be used for a particular data set. A new proposal in terms of intra attribute relationship and a measure MVS (multivariate score) has been introduced to quantify and group different data sets on the basis of the data set correlation structure into several categories. The measure is used to group 63 publicly available bench mark data set according to their characteristics. The performance of different feature selection algorithms on different groups of data are then studied by simulation experiments to verify the relationship o f data set characteristics and the feature selection algorithm. The effect of some other data set characteristics has also been studied. Finally a framework of recommendation regarding the choice of proper feature selection algorithm has been indicated.
feature selection algorithm, data set characteristics, correlation structure, multivariate score