JUCS - Journal of Universal Computer Science 8(2): 153-172, doi: 10.3217/jucs-008-02-0153

Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms

‡ University of Massachusetts at Boston, Department of Mathematics and Computer Science, Boston, Massachusetts, United States of America§ University of Massachusetts Boston, Boston, United States of America

Corresponding author: Dana Cristofor ( dana@cs.umb.edu ) © Dana Cristofor, Dan Simovici. This article is freely available under the J.UCS Open Content License. Citation:
Cristofor D, Simovici D (2002) Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms. JUCS - Journal of Universal Computer Science 8(2): 153-172. https://doi.org/10.3217/jucs-008-02-0153 |

Abstract

In a database with categorical attributes, each attribute defines a partition whose classes can be regarded as natural clusters of rows. In this paper we focus on finding a partition of the rows of a given database, that is as close as possible to the partitions associated to each attribute. We evaluate the closeness of two partitions by using a generalization of the classical conditional entropy. From this perspective, we wish to construct a partition (referred to as the median partition) such that the sum of the dissimilarities between this partition and all the partitions determined by the attributes of the database is minimal. Then, the problem of finding the median partition is an optimization problem, over the space of all partitions of the rows of the database, for which we give an approximative solution. To search more e#ciently the large space of possible partitions we use a genetic algorithm where the partitions are represented by chromosomes. Our genetic algorithm obtains better clustering results than the classical k-means algorithm. 1.) C. S. Calude, K. Salomaa, S. Yu (eds.). Advances and Trends in Automata and Formal Languages. A Collection of Papers in Honour of the 60th Birthday of Helmut Jürgensen.

Keywords

categorical attributes, clustering, genetic algorithm, median partition, partitioning, Shannon entropy, Gini index