Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data

Autoři: Osbert C. Zalay aff001
Působiště autorů: Division of Radiation Oncology, Department of Oncology, Queen’s University, Kingston, Ontario, Canada aff001
Vyšlo v časopise: PLoS ONE 15(1)
Kategorie: Research Article
doi: 10.1371/journal.pone.0227788


Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.

Klíčová slova:

Clustering algorithms – Data mining – Imaging techniques – Linkage analysis – Online encyclopedias – Statistical data – Statistical distributions – Optical comparators


1. Caliński T, Harabasz J. A dendrite method for cluster analysis. Commun Stat. 1974;3: 1–27. doi: 10.1080/03610927408827101

2. Davies DL, Bouldin DW. A Cluster Separation Measure. IEEE Trans Pattern Anal Mach Intell. 1979;PAMI-1: 224–227. doi: 10.1109/TPAMI.1979.4766909

3. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20: 53–65. doi: 10.1016/0377-0427(87)90125-7

4. Sugar CA, James GM. Finding the Number of Clusters in a Dataset. J Am Stat Assoc. 2003;98: 750–763. doi: 10.1198/016214503000000666

5. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol. 2001;63: 411–423.

6. Flexa C, Santos R, Gomes W, Sales C, Costa JCWA. Mutual equidistant-scattering criterion: A new index for crisp clustering. Expert Syst Appl. 2019;128: 225–245. doi: 10.1016/j.eswa.2019.03.027

7. Frey BJ, Dueck D. Clustering by Passing Messages Between Data Points. Science. 2007;315: 972–976. doi: 10.1126/science.1136800 17218491

8. Yu X, Yu G, Wang J. Clustering cancer gene expression data by projective clustering ensemble. PLOS ONE. 2017;12: e0171429. doi: 10.1371/journal.pone.0171429 28234920

9. Ankerst M, Breunig MM, Kriegel H-P, Sander J. OPTICS: Ordering Points to Identify the Clustering Structure. Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: ACM; 1999. pp. 49–60.

10. Ester M, Kriegel H-P, Xu X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD’96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Portland, Oregon; 1996. pp. 226–231.

11. Fränti P. Efficiency of random swap clustering. J Big Data. 2018;5: 13. doi: 10.1186/s40537-018-0122-y

12. Rossignol M, Lagrange M, Cont A. Efficient similarity-based data clustering by optimal object to cluster reallocation. PLOS ONE. 2018;13: e0197450. doi: 10.1371/journal.pone.0197450 29856755

13. Murtagh F, Contreras P. Algorithms for hierarchical clustering: an overview. Wiley Interdiscip Rev Data Min Knowl Discov. 2012;2: 86–97. doi: 10.1002/widm.53

14. Hayman E, Caputo B, Fritz M, Eklundh J-O. On the Significance of Real-World Conditions for Material Classification. In: Pajdla T, Matas J, editors. Computer Vision—ECCV 2004. Springer Berlin Heidelberg; 2004. pp. 253–266.

15. Lazebnik S, Schmid C, Ponce J. A sparse texture representation using local affine regions. IEEE Trans Pattern Anal Mach Intell. 2005;27: 1265–1278. doi: 10.1109/TPAMI.2005.151 16119265

16. Dana KJ, van Ginneken B, Nayar SK, Koenderink JJ. Reflectance and Texture of Real-world Surfaces. ACM Trans Graph. 1999;18: 1–34. doi: 10.1145/300776.300778

17. Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space. ArXiv13013781 Cs. 2013 [cited 7 Jul 2019].

18. Bezdek JC, Pal NR. Some new indexes of cluster validity. IEEE Trans Syst Man Cybern Part B Cybern. 1998;28: 301–315. doi: 10.1109/3477.678624 18255949

19. Forina M, Leardi R, Armanino C, Lanteri S. PARVUS: An extendable package of programs for data exploration, classification and correlation. J Chemom. 1990;4: 191–193. doi: 10.1002/cem.1180040210

20. Dua D, Graff C. UCI machine learning repository. [cited 16 Nov 2019].

21. Hopkins M, Reeber E, Forman G, Suermondt J. Spambase data set. Hewlett-Packard Labs;

22. Aggarwal C, Hinneburg A, Keim DA. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche J, Vianu V, editors. Database Theory—ICDT 2001. Springer Berlin Heidelberg;

23. Blashfield RK. Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. Psychol Bull. 1976;83: 377–388. doi: 10.1037/0033-2909.83.3.377

24. Møller MF. A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 1993;6: 525–533. doi: 10.1016/S0893-6080(05)80056-5

25. Zalay O. Hierarchical linkage regression—Github repository. [cited 21 Nov 2019].

26. Haralick RM. Statistical and structural approaches to texture. Proc IEEE. 1979;67: 786–804. doi: 10.1109/PROC.1979.11328

27. Ojala T, Pietikäinen M, Mäenpää T. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans Pattern Anal Mach Intell. 2002;24: 971–987. doi: 10.1109/TPAMI.2002.1017623

28. vsmlib—Python library for vector space models—vsmlib documentation. [cited 7 Jul 2019].

29. Gui T, Zhang Q, Huang H, Peng M, Huang X. Part-of-Speech Tagging for Twitter with Adversarial Neural Networks. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics; 2017. pp. 2411–2420.

30. Levandowsky M, Winter D. Distance between Sets. Nature. 1971;234: 34–35. doi: 10.1038/234034a0

31. van Rijsbergen CJ. Information Retrieval. 2nd ed. London: Butterworths; 1979.

32. Bewick V, Cheek L, Ball J. Statistics review 10: Further nonparametric methods. Crit Care. 2004;8: 196–199. doi: 10.1186/cc2857 15153238

33. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. ArXiv14126980 Cs. 2017 [cited 20 Nov 2019].

34. DiCiccio TJ, Efron B. Bootstrap Confidence Intervals. Stat Sci. 1996;11: 189–212.

35. Liu Y, Li Z, Xiong H, Gao X, Wu J. Understanding of Internal Clustering Validation Measures. 2010 IEEE International Conference on Data Mining. 2010. pp. 911–916.

36. Müllner D. Modern hierarchical, agglomerative clustering algorithms. ArXiv11092378 Cs Stat. 2011 [cited 7 Jul 2019].

37. Chang D, Kantardzic M, Ouyang M. Hierarchical clustering with CUDA/GPU. in ISCA PDCCS. Online].; 2009. pp. 7–12.

38. Schenker N. Qualms about Bootstrap Confidence Intervals. J Am Stat Assoc. 1985;80: 360–361. doi: 10.1080/01621459.1985.10478123

Článek vyšel v časopise


2020 Číslo 1