Pictures copyrighted

J. Bezdek - c-Means Clustering, with applications to image processing, very large data, and medical computing

  • Fuzzy Models 101. This talk begins with a discussion of uncertainty in the real world, and how it is modeled in science. I discuss Zadeh's 1965 paper using the set-theoretic description of fuzzy models. The mathematical and philosophical differences between fuzzy and probabilistic models are discussed. Membership functions are on display, and are seen to be the atomic units of fuzzy models. They are easy to get – in pattern recognition.
  • Cluster Analysis 101. This section covers material on label vectors, partition sets, and measures of dissimilarity that have been particularly successful in real applications. I characterize the three canonical problems of clustering: tendency assessment (does the data have cluster substructure?); clustering (how do we find partitions of the data?); and validation (are the partitions we find accurate and/or useful?). Then I identify the four types of models used in clustering: partition only (single linkage); prototype only (self-organizing maps); partition and prototypes (hard, fuzzy and possibilistic c-means); and (partition, prototype, other parameter) models (EM algorithm for Gaussian mixture decomposition). I will spend a few minutes discussing classifier design, which is often confused with clustering in the open literature for many reasons, not the least of which is that numerical taxonomists have always referred to clustering as “classification”.
  • The c-Means and EM clustering models. I will give a detailed account of the basic models and algorithms for hard, fuzzy, possibilistic c-means; and likewise for Gaussian mixture decomposition with MLE by the EM Algorithm. Several examples highlight similarities and differences of these four models, and I will discuss the use of these models for elementary segmentation of medical images. This section also includes a brief history of the c-means and GMD-EM clustering models.
  • Approximations to FCM. There have been many ideas for acceleration of FCM advanced over the years. This section discusses eight methods that can be used to speed up clustering with FCM when the entire data set is loadable. Included are AFCM, mrFCM, brFCM and a method for acceleration that depends on mounting the algorithm on a GPU instead of a conventional computer. Reported improvements in speed range from 2:1 to 100:1. Many of these tricks are applicable to EM clustering.
  • Clustering with FCM and EM in VL Data. This last section covers two types of approaches for (approximate) FCM clustering in VL data. First I will discuss two methods that are based on incremental, distributed clustering (spFCM and oFCM). oFCM is used for streaming data and on-line clustering. The second set of methods considers three approaches to approximating FCM clusters for VL data by a much different method – viz., sampling followed by non-iterative extension. This is a very general technique that can be used for EM and many other algorithms that are not discussed in this talk. The three algorithms are eFFCM (efficient fast FCM for image data); geFFCM (for feature vector data); and eNERF (for relational data). We will most likely be out of time by now, so geFFCM will likely get the short straw. I have some nice medical images to show for the distributed methods, but only Gaussian mixtures to exemplify the extension techniques.

Remark: I don’t expect to cover all of this in two hours, so some of the material may not be covered as advertised. But I will try !