Factoring a 2 x 2 contingency table

Autoři: Stanley Luck aff001
Působiště autorů: Science, Technology and Research Institute of Delaware, Wilmington, DE, United States of America aff001
Vyšlo v časopise: PLoS ONE 14(10)
Kategorie: Research Article
doi: 10.1371/journal.pone.0224460


We show that a two-component proportional representation provides the necessary framework to account for the properties of a 2 × 2 contingency table. This corresponds to the factorization of the table as a product of proportion and diagonal row or column sum matrices. The row and column sum invariant measures for proportional variation are obtained. Geometrically, these correspond to displacements of two point vectors in the standard one-simplex, which are reduced to a center-of-mass coordinate representation, ( δ , μ ) ∈ R 2. Then, effect size measures, such as the odds ratio and relative risk, correspond to different perspective functions for the mapping of (δ, μ) to R 1. Furthermore, variations in δ and μ will be associated with different cost-benefit trade-offs for a given application. Therefore, pure mathematics alone does not provide the specification of a general form for the perspective function. This implies that the question of the merits of the odds ratio versus relative risk cannot be resolved in a general way. Expressions are obtained for the marginal sum dependence and the relations between various effect size measures, including the simple matching coefficient, odds ratio, relative risk, Yule’s Q, ϕ, and Goodman and Kruskal’s τc|r. We also show that Gini information gain (IGG) is equivalent to ϕ2 in the classification and regression tree (CART) algorithm. Then, IGG can yield misleading results due to the dependence on marginal sums. Monte Carlo methods facilitate the detailed specification of stochastic effects in the data acquisition process and provide a practical way to estimate the confidence interval for an effect size.

Klíčová slova:

Algorithms – Data acquisition – Decision trees – Linkage disequilibrium – Monte Carlo method – Normal distribution – Nursing homes – Contingency tables


1. Yule GU. On the Methods of Measuring Association Between Two Attributes. Journal of the Royal Statistical Society. 1912;75(6):579–652. doi: 10.2307/2340126

2. Goodman LA, Kruskal WH. Measures of Association for Cross Classifications. J Amer Statis Assoc. 1954;49:732–764. doi: 10.1080/01621459.1954.10501231

3. Hedrick P. Gametic disequilibrium measures: proceed with caution. Genetics. 1987;341:331–341.

4. Davenport EC, El-Sanhurry NA. Phi/Phimax: Review and Synthesis. Educational and Psychological Measurement. 1991;51(4):821–828. doi: 10.1177/001316449105100403

5. VanLiere JM, Rosenberg NA. Mathematical properties of the r2 measure of linkage disequilibrium. Theoretical Population Biology. 2008;74(1):130–137. doi: 10.1016/j.tpb.2008.05.006 18572214

6. Olivier J, Bell ML. Effect Sizes for 2 × 2 Contingency Tables. PLoS ONE. 2013;8(3):e58777. doi: 10.1371/journal.pone.0058777 23505560

7. Haddock CK, Rindskopf D, Shadish WR. Using odds ratios as effect sizes for meta-analysis of dichotomous data: A primer on methods and issues. Psychological Methods. 1998;3(3):339–353. doi: 10.1037/1082-989X.3.3.339

8. Kraemer HC. Reconsidering the odds ratio as a measure of 2 × 2 association in a population. Statistics in Medicine. 2004;23(2):257–270. doi: 10.1002/sim.1714 14716727

9. Ruxton GD, Neuhäuser M. Review of alternative approaches to calculation of a confidence interval for the odds ratio of a 2 × 2 contingency table. Methods in Ecology and Evolution. 2013;4(1):9–13. doi: 10.1111/j.2041-210x.2012.00250.x

10. Grant RL. Converting an odds ratio to a range of plausible relative risks for better communication of research findings. BMJ. 2014;348(jan24 1):f7450–f7450. doi: 10.1136/bmj.f7450 24464277

11. Warrens MJ. On Association Coefficients for 2 × 2 Tables and Properties That Do Not Depend on the Marginal Distributions. Psychometrika. 2008;73(4):777–789. doi: 10.1007/s11336-008-9070-3 20046834

12. Hubálek Z. Coefficients of Association and Similarity, Based on Binary (Presence-Absense) Data: An Evaluation. Biological Reviews. 1982;57(4):669–689. doi: 10.1111/j.1469-185X.1982.tb00376.x

13. Boyd SP, Vandenberghe L. Convex optimization. New York, NY: Cambridge University Press; 2004.

14. Beló A, Zheng P, Luck S, Shen B, Meyer DJ, Li B, et al. Whole genome scan detects an allelic variant of fad2 associated with increased oleic acid levels in maize. Molecular Genetics and Genomics. 2008;279(1):1–10. doi: 10.1007/s00438-007-0289-y

15. Loh WY. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2011;1(1):14–23.

16. Krzywinski M, Altman N. Points of Significance: Classification and regression trees. Nature Methods. 2017;14(8):757–758. doi: 10.1038/nmeth.4370

17. Reid M, Szendröi B. Geometry and Topology. New York: Cambridge University Press; 2005.

18. Bland JM, Altman DG. Statistics Notes: The odds ratio. BMJ. 2000;320(7247):1468–1468. doi: 10.1136/bmj.320.7247.1468 10827061

19. Newcombe RG. A deficiency of the odds ratio as a measure of effect size. Statistics in Medicine. 2006;25(24):4235–4240. doi: 10.1002/sim.2683 16927451

20. Sistrom CL, Garvan CW. Proportions, Odds, and Risk. Radiology. 2004;230(1):12–19. doi: 10.1148/radiol.2301031028 14695382

21. Pearson K, Heron D. On Theories of Association. Biometrika. 1913;9:159–315. doi: 10.2307/2331805

22. Zysno PV. The modification of the phi-coefficient reducing its dependence on the marginal distributions. Methods of Psychological Research. 1997;2(1):41–53.

23. Richardson JT. The analysis of 2 × 1 and 2 × 2 contingency tables: an historical review. Statistical Methods in Medical Research. 1994;3(2):107–133. doi: 10.1177/096228029400300202 7952428

24. Cohen J. A power primer. Psychological Bulletin. 1992;112(1):155–159. doi: 10.1037//0033-2909.112.1.155 19565683

25. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological reviews of the Cambridge Philosophical Society. 2007;82(4):591–605. doi: 10.1111/j.1469-185X.2007.00027.x 17944619

26. Cumming G. Understanding The New Statistics. New York, NY: Routledge; 2012.

27. Marsaglia G. Ratios of Normal Variables. Journal of Statistical Software. 2006;16(4):1–10. doi: 10.18637/jss.v016.i04

28. von Luxburg U, Franz VH. A Geometric Approach to Confidence Sets for Ratios: Fieller’s Theorem, Generalizations, and Bootstrap. Statistica Sinica. 2009;19:1095–1117.

29. Newcombe RG. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine. 1998;17(8):873–890. doi: 10.1002/(sici)1097-0258(19980430)17:8<873::aid-sim779>3.0.co;2-i 9595617

30. Agresti A. Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions, and odds ratios more exact. Statistical Methods in Medical Research. 2003;12(1):3–21. doi: 10.1191/0962280203sm311ra 12617505

31. Banik S, Kibria BM. Confidence Intervals for the Population Correlation Coefficient ρ. International Journal of Statistics in Medical Research. 2016;5(2):99–111. doi: 10.6000/1929-6029.2016.05.02.4

32. Bishara AJ, Hittner JB. Confidence intervals for correlations when data are not normal. Behavior Research Methods. 2017;49(1):294–309. doi: 10.3758/s13428-016-0702-8 26822671

33. Bevington PR, Robinson DK. Data Reduction and Error Analysis for the Physical Sciences. 3rd ed. New York, NY: McGraw-Hill; 2003.

34. Kroese DP, Brereton T, Taimre T, Botev ZI. Why the Monte Carlo method is so important today. Wiley Interdisciplinary Reviews: Computational Statistics. 2014;6(6):386–392. doi: 10.1002/wics.1314

35. Buonaccorsi JP. Measurement error: models, methods, and applications. Boca Raton: Chapman and Hall/CRC; 2010.

36. Höfler M. The effect of misclassification on the estimation of association: a review. International Journal of Methods in Psychiatric Research. 2005;14(2):92–101. doi: 10.1002/mpr.20

37. Berry KJ, Johnston JE, Mielke PW. A Measure of Effect Size for R × C Contingency Tables. Psychological Reports. 2006;99(1):251–256. doi: 10.2466/pr0.99.1.251-256 17037476

38. Thomson G, Single RM. Conditional Asymmetric Linkage Disequilibrium (ALD): Extending the Biallelic r2 Measure. Genetics. 2014;198(1):321–331. doi: 10.1534/genetics.114.165266 25023400

39. Logan JD. Applied Mathematics. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 1997.

40. Casella G, Berger R. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury; 2002.

41. Kateri M. Contingency Table Analysis. New York, NY: Springer New York; 2014.

42. Kettenring JR. Coping with high dimensionality in massive datasets. Wiley Interdisciplinary Reviews: Computational Statistics. 2011;3(2):95–103. doi: 10.1002/wics.141

43. Coveney PV, Dougherty ER, Highfield RR. Big data need big theory too. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016;374(2080):20160153. doi: 10.1098/rsta.2016.0153

44. Duda RO, Hart PE, Stork DG. Pattern classification. Wiley; 2001.

45. de Ville B. Decision trees. Wiley Interdisciplinary Reviews: Computational Statistics. 2013;5(6):448–455. doi: 10.1002/wics.1278

46. Loh WY. Fifty Years of Classification and Regression Trees. International Statistical Review. 2014;82(3):329–348. doi: 10.1111/insr.12016

47. Mingers J. An empirical comparison of selection measures for decision-tree induction. Machine Learning. 1989;3(4):319–342. doi: 10.1023/A:1022645801436

48. Krzywinski M, Altman N. Error bars. Nature Methods. 2013;10(10):921–922. doi: 10.1038/nmeth.2659 24161969

49. Nursing Home Compare datasets; 2018. Available from: https://data.medicare.gov/data/nursing-home-compare.

50. Quartararo M, Glasziou P, Kerr CB. Classification Trees for Decision Making in Long-Term Care. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences. 1995;50A(6):M298–M302. doi: 10.1093/gerona/50A.6.M298

51. Alexander GL. An analysis of nursing home quality measures and staffing. Quality management in health care. 2008;17(3):242–51. doi: 10.1097/01.QMH.0000326729.78331.c5 18641507

52. Raju D, Su X, Patrician PA, Loan LA, McCarthy MS. Exploring factors associated with pressure ulcers: A data mining approach. International Journal of Nursing Studies. 2015;52(1):102–111. doi: 10.1016/j.ijnurstu.2014.08.002 25192963

53. Nursing Home Quality Measures; 2019. Available from: https://nursinghomemeasures.com/.

54. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12(Oct):2825–2830.

55. Wasserstein RL, Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129–133. doi: 10.1080/00031305.2016.1154108

56. Leek J, McShane BB, Gelman A, Colquhoun D, Nuijten MB, Goodman SN. Five ways to fix statistics. Nature. 2017;551(7682):557–559. doi: 10.1038/d41586-017-07522-z 29189798

57. Grissom RJ, Kim JJ. Effect Sizes for Research. 2nd ed. New York, NY: Routledge; 2011.

58. Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936;7(2):179–188. doi: 10.1111/j.1469-1809.1936.tb02137.x

Článek vyšel v časopise


2019 Číslo 10