Why Cohen’s Kappa should be avoided as performance measure in classification


Autoři: Rosario Delgado aff001;  Xavier-Andoni Tibau aff002
Působiště autorů: Department of Mathematics, Universitat Autònoma de Barcelona, Campus de la UAB, Cerdanyola del Vallès, Spain aff001;  Advanced Stochastic Modelling research group, Universitat Autònoma de Barcelona, Campus de la UAB, Cerdanyola del Vallès, Spain aff002
Vyšlo v časopise: PLoS ONE 14(9)
Kategorie: Research Article
doi: https://doi.org/10.1371/journal.pone.0222916

Souhrn

We show that Cohen’s Kappa and Matthews Correlation Coefficient (MCC), both extended and contrasted measures of performance in multi-class classification, are correlated in most situations, albeit can differ in others. Indeed, although in the symmetric case both match, we consider different unbalanced situations in which Kappa exhibits an undesired behaviour, i.e. a worse classifier gets higher Kappa score, differing qualitatively from that of MCC. The debate about the incoherence in the behaviour of Kappa revolves around the convenience, or not, of using a relative metric, which makes the interpretation of its values difficult. We extend these concerns by showing that its pitfalls can go even further. Through experimentation, we present a novel approach to this topic. We carry on a comprehensive study that identifies an scenario in which the contradictory behaviour among MCC and Kappa emerges. Specifically, we find out that when there is a decrease to zero of the entropy of the elements out of the diagonal of the confusion matrix associated to a classifier, the discrepancy between Kappa and MCC rise, pointing to an anomalous performance of the former. We believe that this finding disables Kappa to be used in general as a performance measure to compare classifiers.

Klíčová slova:

Entropy – Machine learning – Medicine and health sciences – Probability distribution – Protein structure prediction – Psychology – Statistical distributions – Covariance


Zdroje

1. Ferri C., Hernández-Orallo J., Modroiu R.: An experimental comparison of performance measures for classification. Pattern Recognition Letters 30(1), 27–38 (2009) doi: 10.1016/j.patrec.2008.08.010

2. Jurman G., Riccadonna S., Furlanello C.: A comparison of mcc and cen error measures in multi-class prediction. PloS one 7(8), e41882 (2012) doi: 10.1371/journal.pone.0041882

3. Sokolova M., Lapalme G.: A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4), 427–437 (2009) doi: 10.1016/j.ipm.2009.03.002

4. Matthews B.W.: Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2), 442–451 (1975) doi: 10.1016/0005-2795(75)90109-9

5. Gorodkin J.: Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry 28(5-6), 367–374 (2004) doi: 10.1016/j.compbiolchem.2004.09.006 15556477

6. Stokić D., Hanel R., Thurner S.: A fast and efficient gene-network reconstruction method from multiple over-expression experiments. BMC bioinformatics 10(1), 253 (2009) doi: 10.1186/1471-2105-10-253 19686586

7. Supper, J., Spieth, C., Zell, A.: Reconstructing linear gene regulatory networks. In: European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, pp. 270–279. Springer (2007)

8. Blair E., Stanley F.: Interobserver agreement in the classification of cerebral palsy. Developmental Medicine & Child Neurology 27(5), 615–622 (1985) doi: 10.1111/j.1469-8749.1985.tb14133.x

9. Cameron M.L., Briggs K.K., Steadman J.R.: Reproducibility and reliability of the outerbridge classification for grading chondral lesions of the knee arthroscopically. The American journal of sports medicine 31(1), 83–86 (2003) doi: 10.1177/03635465030310012601 12531763

10. Monserud R.A., Leemans R.: Comparing global vegetation maps with the Kappa statistic. Ecological modelling 62(4), 275–293 (1992) doi: 10.1016/0304-3800(92)90003-W

11. Allouche O., Tsoar A., & Kadmon R.: Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). Journal of applied ecology 43(6), 1223–1232 (2006) doi: 10.1111/j.1365-2664.2006.01214.x

12. Tian Y., Zhang H., Pang Y., Lin J.: Classification for single-trial N170 during responding to facial picture with emotion. Front. Comput. Neurosci. 12:68. doi: 10.3389/fncom.2018.00068 30271337

13. Donker D., Hasman A., Van Geijn H.: Interpretation of low Kappa values. International journal of bio-medical computing 33(1), 55–64 (1993) 8349359

14. Forbes A.D.: Classification-algorithm evaluation: Five performance measures based onconfusion matrices. Journal of Clinical Monitoring 11(3), 189–206 (1995) doi: 10.1007/BF01617722 7623060

15. Brennan R.L., Prediger D.J.: Coefficient Kappa: Some uses, misuses, and alternatives. Educational and psychological measurement 41(3), 687–699 (1981) doi: 10.1177/001316448104100307

16. Maclure M., Willett W.C.: Misinterpretation and misuse of the Kappa statistic. American journal of epidemiology 126(2), 161–169 (1987) doi: 10.1093/aje/126.2.161 3300279

17. Uebersax J.S.: Diversity of decision-making models and the measurement of interrater agreement. Psychological bulletin 101(1), 140–146 (1987) doi: 10.1037/0033-2909.101.1.140

18. Feinstein A.R., Cicchetti D.V.: High agreement but low Kappa: I. the problems of two paradoxes. Journal of clinical epidemiology 43(6), 543–549 (1990) doi: 10.1016/0895-4356(90)90158-l 2348207

19. Cicchetti D.V., Feinstein A.R.: High agreement but low Kappa: Ii. resolving the paradoxes. Journal of clinical epidemiology 43(6), 551–558 (1990) doi: 10.1016/0895-4356(90)90159-m 2189948

20. Krippendorff K.: Reliability in content analysis: Some common misconceptions and recommendations. Human communication research 30(3), 411–433 (2004) doi: 10.1111/j.1468-2958.2004.tb00738.x

21. Warrens M.J.: A formal proof of a paradox associated with Cohen’s Kappa. Journal of Classification 27(3), 322–332 (2010) doi: 10.1007/s00357-010-9060-x

22. Byrt T., Bishop J., & Carlin J. B.: Bias, prevalence and kappa. Journal of clinical epidemiology 46(5), 423–429 (1993) doi: 10.1016/0895-4356(93)90018-v 8501467

23. de Vet H.C., Mokkink L.B., Terwee C.B., Hoekstra O.S., Knol D.L.: Clinicians are right not to like Cohen’s Kappa. BMJ 346, f2125 (2013) doi: 10.1136/bmj.f2125 23585065

24. Dice L. R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945) doi: 10.2307/1932409

25. Albatineh A. N., Niewiadomska-Bugaj M., & Mihalko D.: On similarity indices and correction for chance agreement. Journal of Classification 23(2), 301–313 (2006) doi: 10.1007/s00357-006-0017-z

26. Warrens M. J.: On similarity coefficients for 2 × 2 tables and correction for chance. Psychometrika 73(3), 487 (2008) doi: 10.1007/s11336-008-9059-y 20037641

27. Cohen J.: A coefficient of agreement for nominal scales. Educational and psychological measurement 20(1), 37–46 (1960) doi: 10.1177/001316446002000104

28. Scott W.A.: Reliability of content analysis: The case of nominal scale coding. Public opinion quarterly pp. 321–325 (1955) doi: 10.1086/266577

29. Mak T. K.: Analysing intraclass correlation for dichotomous variables. Journal of the Royal Statistical Society: Series C (Applied Statistics) 37(3), 344–352 (1988)

30. Goodman L. A., & Kruskal W. H.: Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association, 58(302), 310–364 (1963) doi: 10.1080/01621459.1963.10500850

31. Brennan R. L., & Light R. J.: Measuring agreement when two observers classify people into categories not defined in advance. British Journal of Mathematical and Statistical Psychology 27(2), 154–163 (1974) doi: 10.1111/j.2044-8317.1974.tb00535.x

32. Bexkens R., Claessen F. M., Kodde I. F., Oh L. S., Eygendaal D., & van den Bekerom M. P.: The kappa paradox. Shoulder & Elbow, 10(4), 308–308 (2018) doi: 10.1177/1758573218791813

33. Viera A. J., & Garrett J. M.: Understanding interobserver agreement: the kappa statistic. Fam med 37(5), 360–363 (2005) 15883903

34. Sim J., & Wright C. C.: The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical therapy 85(3), 257–268 (2005) 15733050

35. Warrens M.J.: On association coefficients, correction for chance, and correction for maximum value. Journal of Modern Mathematics Frontier 2(4), 111–119 (2013)

36. Andrés A.M., Marzo P.F.: Delta: A new measure of agreement between two raters. British journal of mathematical and statistical psychology 57(1), 1–19 (2004) doi: 10.1348/000711004849268 15171798

37. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al.: Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct), 2825–2830 (2011)

38. Kuhn M., et al.: Caret package. Journal of statistical software 28(5), 1–26 (2008)

39. Huang C., Davis L., Townshend J.: An assessment of support vector machines for land cover classification. International Journal of remote sensing 23(4), 725–749 (2002) doi: 10.1080/01431160110040323

40. Duro D.C., Franklin S.E., Dubé M.G.: A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using spot-5 HRG imagery. Remote Sensing of Environment 118, 259–272 (2012) doi: 10.1016/j.rse.2011.11.020

41. Passos A.N., Kohara V.S., Freitas R.S.d., Vicentini A.P.: Immunological assays employed for the elucidation of an histoplasmosis outbreak in São Paulo, SP. Brazilian Journal of Microbiology 45(4), 1357–1361 (2014) doi: 10.1590/s1517-83822014000400028 25763041

42. Claessen F. M., van den Ende K. I., Doornberg J. N., Guitton T. G., Eygendaal D., van den Bekerom M. P., … & Wagener M.: Osteochondritis dissecans of the humeral capitellum: reliability of four classification systems using radiographs and computed tomography. Journal of shoulder and elbow surgery 24(10), 1613–1618 (2015) doi: 10.1016/j.jse.2015.03.029 25953486

43. Powers, D.M.W.: The problem with Kappa. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 345–355. Association for Computational Linguistics (2012)

44. Jeni, L.A., Cohn, J.F., De La Torre, F.: Facing imbalanced data–recommendations for the use of performance metrics. In: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pp. 245–251. IEEE (2013)

45. Zhao X., Liu J.S., Deng K.: Assumptions behind intercoder reliability indices. In Salmon Charles T. (ed.) Communication Yearbook 36, 419–480. New York: Routledge (2013)

46. Witten I.H., Frank E., Hall M.A., Pal C.J.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2016)

47. Krippendorff K.: Association, agreement, and equity. Quality and Quantity 21(2), 109–123 (1987) doi: 10.1007/BF00167603

48. Krippendorff K.: Content analysis: An introduction to its methodology (1980)


Článek vyšel v časopise

PLOS One


2019 Číslo 9
Nejčtenější tento týden