JUCS - Journal of Universal Computer Science 8(2): 153-172, doi: 10.3217/jucs-008-02-0153
Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms
expand article infoDana Cristofor, Dan Simovici§
‡ University of Massachusetts at Boston, Department of Mathematics and Computer Science, Boston, Massachusetts, United States of America§ University of Massachusetts Boston, Boston, United States of America
Open Access
Abstract
In a database with categorical attributes, each attribute defines a partition whose classes can be regarded as natural clusters of rows. In this paper we focus on finding a partition of the rows of a given database, that is as close as possible to the partitions associated to each attribute. We evaluate the closeness of two partitions by using a generalization of the classical conditional entropy. From this perspective, we wish to construct a partition (referred to as the median partition) such that the sum of the dissimilarities between this partition and all the partitions determined by the attributes of the database is minimal. Then, the problem of finding the median partition is an optimization problem, over the space of all partitions of the rows of the database, for which we give an approximative solution. To search more e#ciently the large space of possible partitions we use a genetic algorithm where the partitions are represented by chromosomes. Our genetic algorithm obtains better clustering results than the classical k-means algorithm. 1.) C. S. Calude, K. Salomaa, S. Yu (eds.). Advances and Trends in Automata and Formal Languages. A Collection of Papers in Honour of the 60th Birthday of Helmut Jürgensen.
Keywords
categorical attributes, clustering, genetic algorithm, median partition, partitioning, Shannon entropy, Gini index