

Data Clustering: Theory, Algorithms, and Applications, Second Edition
D**T
Useful Overview of Algorithms, But Needs a More Systemic Framework
This is a useful compendium of a variety of methods of clustering, for a variety of data types, with numerous measures of similarity, and many examples of algorithms. The ultimate emphasis is on the algorithms, even the implementation in MATLAB or C++.However this book is short on useful theoretical frameworks, reflecting more the efforts of practitioners from various fields than that of applied mathematicians, despite the SIAM imprint. There are times when it seems that the authors were just writing quick overviews of different papers, sometimes changing notation, even with typographical errors or other lapses, without trying to unify the different methods by developing a common framework.I developed a new clustering algorithm for a particular application to proportional representation in voting and was surprised that none of the algorithms described in this book encompassed my method. This is despite the fact that most of my concepts have been previously used by others in other contexts. For example, I use fuzzy sets, learned directly from Bezdek many years ago, but not as described by the k-means or c-means formulas presented in chapter 8, even though the k-means centroiding idea is easily generalized to encompass my method.A good general framework for clustering algorithms would be that of optimization theory, with both discrete and continuous aspects. Yet the authors do not appear to be well versed in this theory. For example, instead of using the phrase “objective function” they use “validity indices” and survey only a few specific formulas, rather than describing a framework to give practitioners better guidance for constructing an objective function well suited to the problem at hand. In my case, I wanted clusters (= voting blocks) that are not too small, reasonably compact, with some but not too much overlap permitted, and which include the bulk of the population, permitting some voters to be outside any cluster or at least without full membership in the clusters. I also wanted an objective function with a fairly smooth dependence on the data and on parameters of the algorithm, without sudden jumps. I was able to do this, but this book would not have helped.Another important practical aspect of clustering is not covered at all. Namely, how to handle very large sets of data. In scientific computing a common paradigm is to start with “discretization” – you take continuous or nearly continuous data and first organize it into bundles based on similarity. This is itself a problem of partition-type clustering. But at this stage the emphasis is on speed of processing, not optimality, seeking a radical reduction in problem size with minimal loss of information. How to best do it is very problem dependent, but an obvious place to start is k-means partitioning for a value of k that is fairly large yet still much smaller than the original number of data points. Then apply a more optimal procedure to these k data points. With categorical data, sorting procedures on the categories may be used to get even better discretizations. In my case the data points are ballots where voters rank candidates, so that I can sort ballots by the highest ranked candidates, using k-means to represent the lower rankings by a centroid. Note that the resulting data points have different weights (represent different numbers of voters), meaning that the follow-on algorithm will be for a graph weighted vertices as well as weighted edges (= similarity, such as correlation). However, this framework is not described in the book.Another puzzling omission is the lack of discussion of good methods for initializing iterative algorithms, despite the acknowledgement on p. 164 that k-means is very dependent on the initialization of the center vectors. In my case I simply use a variety of crude techniques to initialize, then build into my iterative solver a procedure than merges strongly overlapping clusters and deletes very small clusters. Then my objective function allows me to rank the results. Thus “quick and dirty” clustering schemes do have an important role, even in the context of more optimal algorithms.In conclusion, this book provides fairly readable snapshots of clustering techniques in action, but would greatly benefit from a more systemic approach.
A**R
Five Stars
like this one . Great for students!
A**R
Five Stars
Great books for students!
Trustpilot
4 days ago
2 months ago