Predicting breast cancer risk using personal health data and machine learning models

Autoři: Gigi F. Stark aff001;  Gregory R. Hart aff001;  Bradley J. Nartowt aff001;  Jun Deng aff001
Působiště autorů: Department of Therapeutic Radiology, Yale University, New Haven, CT, United States of America aff001
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article
doi: 10.1371/journal.pone.0226765


Among women, breast cancer is a leading cause of death. Breast cancer risk predictions can inform screening and preventative actions. Previous works found that adding inputs to the widely-used Gail model improved its ability to predict breast cancer risk. However, these models used simple statistical architectures and the additional inputs were derived from costly and / or invasive procedures. By contrast, we developed machine learning models that used highly accessible personal health data to predict five-year breast cancer risk. We created machine learning models using only the Gail model inputs and models using both Gail model inputs and additional personal health data relevant to breast cancer risk. For both sets of inputs, six machine learning models were trained and evaluated on the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial data set. The area under the receiver operating characteristic curve metric quantified each model’s performance. Since this data set has a small percentage of positive breast cancer cases, we also reported sensitivity, specificity, and precision. We used Delong tests (p < 0.05) to compare the testing data set performance of each machine learning model to that of the Breast Cancer Risk Prediction Tool (BCRAT), an implementation of the Gail model. None of the machine learning models with only BCRAT inputs were significantly stronger than the BCRAT. However, the logistic regression, linear discriminant analysis, and neural network models with the broader set of inputs were all significantly stronger than the BCRAT. These results suggest that relative to the BCRAT, additional easy-to-obtain personal health inputs can improve five-year breast cancer risk prediction. Our models could be used as non-invasive and cost-effective risk stratification tools to increase early breast cancer detection and prevention, motivating both immediate actions like screening and long-term preventative measures such as hormone replacement therapy and chemoprevention.

Klíčová slova:

Breast cancer – Cancer screening – Hispanic people – Linear discriminant analysis – Machine learning – Neural networks – Screening guidelines – Support vector machines


1. Pfeiffer RM, Park Y, Kreimer AR, Lacey JV Jr, Pee D, Greenlee RT, et al. Risk prediction for breast, endometrial, and ovarian cancer in white women aged 50 y or older: derivation and validation from population-based cohort studies. PLoS Med. 2013 Jul 30;10(7):e1001492. doi: 10.1371/journal.pmed.1001492 23935463

2. Evans DG, Howell A. Breast cancer risk-assessment models. Breast Cancer Res. 2007 Sep 12;9(5):213. doi: 10.1186/bcr1750 17888188

3. U. S. Preventive Services Task Force [Internet]. Final Update Summary: Breast Cancer: Screening; 2019 May [cited 2019 Sep 20]. Available from:

4. National Cancer Institute, Epidemiology and Genomics Research Program [Internet]. Breast Cancer Risk Prediction Models; 2018 Feb 1 [cited 2019 Sep 20]. Available from:

5. MDCalc [Internet]. Gail Model for Breast Cancer Risk; 2019 [cited 2019 Sep 20]. Available from:

6. Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst. 1989 Dec 20;81(24):1879–1886. doi: 10.1093/jnci/81.24.1879 2593165

7. National Cancer Institute [Internet]. The Breast Cancer Risk Assessment Tool; [cited 2019 Sep 20]. Available from:

8. Elmore JG, Fletcher SW. The risk of cancer risk prediction: “What is my risk of getting breast cancer?” J Natl Cancer Inst. 2006 Dec 6;98(23):1673–1675. doi: 10.1093/jnci/djj501 17148763

9. Chlebowski RT, Anderson GL, Lane DS, Aragaki AK, Rohan T, Yasmeen S, et al. Predicting risk of breast cancer in postmenopausal women by hormone receptor status. J Natl Cancer Inst. 2007 Nov 21;99(22):1695–1705. doi: 10.1093/jnci/djm224 18000216

10. Rockhill B, Spiegelman D, Byrne C, Hunter DJ, Colditz GA. Validation of the Gail et al. model of breast cancer risk prediction and implications for chemoprevention. J Natl Cancer Inst. 2001 Mar 7;93(5):358–366. doi: 10.1093/jnci/93.5.358 11238697

11. Tice JA, Cummings SR, Smith-Bindman R, Ichikawa L, Barlow WE, Kerlikowske K. Using clinical factors and mammographic breast density to estimate breast cancer risk: development and validation of a new predictive model. Ann Intern Med. 2008 Mar 4;148(5):337–347. doi: 10.7326/0003-4819-148-5-200803040-00004 18316752

12. Zhang X, Rice M, Tworoger SS, Rosner BA, Eliassen AH, Tamimi RM, et al. Addition of a polygenic risk score, mammographic density, and endogenous hormones to existing breast cancer risk prediction models: a nested case-control study. PLoS Med. 2018 Sep 4;15(9):e1002644. doi: 10.1371/journal.pmed.1002644 30180161

13. Darabi H, Czene K, Zhao W, Liu J, Hall P, Humphreys K. Breast cancer risk prediction and individualised screening based on common genetic variation and breast density measurement. Breast Cancer Res. 2012 Feb 7;14(1):R25. doi: 10.1186/bcr3110 22314178

14. Mealiffe ME, Stokowski RP, Rhees BK, Prentice RL, Pettinger M, Hinds DA. Assessment of clinical validity of a breast cancer risk model combining genetic and clinical information. J Natl Cancer Inst. 2010 Nov 3;102(21):1618–1627. doi: 10.1093/jnci/djq388 20956782

15. Dite GS, Mahmoodi M, Bickerstaffe A, Hammet F, Macinnis RJ, Tsimiklis H, et al. Using SNP genotypes to improve the discrimination of a simple breast cancer risk prediction model. Breast Cancer Res Treat. 2013 Jun 18;139(3):887–896. doi: 10.1007/s10549-013-2610-2 23774992

16. Wacholder S, Hartge P, Prentice R, Garcia-Closas M, Feigelson HS, Diver WR, et al. Performance of common genetic variants in breast-cancer risk models. N Engl J Med. 2010 Mar 18;362(11):986–993. doi: 10.1056/NEJMoa0907727 20237344

17. Tice JA, Miike R, Adduci K, Petrakis NL, King E, Wrensch MR. Nipple aspirate fluid cytology and the Gail model for breast cancer risk assessment in a screening population. Cancer Epidemiol Biomarkers Prev. 2005 Feb 1;14(2):324–328. doi: 10.1158/1055-9965.EPI-04-0289 15734953

18. Clendenen TV, Ge W, Koenig KL, Afanasyeva Y, Agnoli C, Brinton LA, et al. Breast cancer risk prediction in women aged 35-50 years: impact of including sex hormone concentrations in the Gail model. Breast Cancer Res. 2019 Mar 19;21(1):42. doi: 10.1186/s13058-019-1126-z 30890167

19. Hart GR, Nartowt BJ, Muhammad W, Liang Y, Huang GS, Deng J. Endometrial cancer risk prediction and stratification: human versus machine intelligence. JAMA Oncol. 2019 (under review).

20. Kramer BS, Gohagan J, Prorok PC, Smart C. A National Cancer Institute sponsored screening trial for prostatic, lung, colorectal, and ovarian cancers. Cancer. 1993 Jan 15;71:589–593. doi: 10.1002/cncr.2820710215 8420681

21. Susan G. Komen [Internet]. Breast Cancer Risk Factors Table; 2018 Dec 10 [cited 2019 Sep 20]. Available from:

22. Centers for Disease Control and Prevention [Internet]. CDC—What Are the Risk Factors for Breast Cancer?; 2018 Sep 11 [cited 2019 Sep 20]. Available from:

23. [Internet]. Breast Cancer Risk Factors; 2019 [cited 2019 Sep 20]. Available from:

24. Susan G. Komen [Internet]. Age at First Childbirth and Number of Childbirths; 2018 Nov 27 [cited 2019 Sep 20]. Available from:

25. van Rossum G, Drake FL. Python 3 reference manual. Paramount (CA): CreateSpace; 2009.

26. Zhang F. BCRA: Breast Cancer Risk Assessment [Internet]. 2018. Available from:

27. R Core Team [Internet]. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2017. Available from:

28. Saleh H. Machine learning fundamentals: use Python and scikit-learn to get up and running with the hottest developments in machine learning. Birmingham, United Kingdom: Packt Publishing; 2018. Chapter 1: Introduction to scikit-learn; p. 1-37.

29. Moons KG, Altman DG, Reitsma JB, Ioannidis JP, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015 Jan 6;162(1):W1–W73. doi: 10.7326/M14-0698 25560730

30. Lorena AC, Jacintho LFO, Siqueira MF, de Giovanni R, Lohmann LG, de Carvalho ACPLF, et al. Comparing machine learning classifiers in potential distribution modelling. Expert Syst Appl. 2011 May;38(5):5268–5275. doi: 10.1016/j.eswa.2010.10.031

31. Pohar M, Blas M, Turk S. Comparison of logistic regression and linear discriminant analysis: a simulation study. Metodološki zvezki. 2004;1(1):143–161.

32. Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J Clin Epidemiol. 1996 Nov; 49(11):1225–1231. doi: 10.1016/s0895-4356(96)00002-9 8892489

33. Al-Aidaroos KM, Bakar AA, Othman Z. Naive Bayes variants in classification learning. In: 2010 International Conference on Information Retrieval & Knowledge Management (CAMP); 2010 Mar 17-18; Shah Alam, Malaysia. IEEE; 2010. p. 276-281.

34. Miguel-Hurtado O, Guest R, Stevenage SV, Neil GJ, Black S. Comparing machine learning classifiers and linear/logistic regression to explore the relationship between hand dimensions and demographic characteristics. PLoS One. 2016 Nov 2;11(11):e0165521. doi: 10.1371/journal.pone.0165521 27806075

35. Balakrishnama S, Ganapathiraju A. Linear discriminant analysis-a brief tutorial. Institute for Signal and Information Processing. 1998 Mar 2;18:1–8.

36. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995 Sep;20(3):273–297. doi: 10.1023/A:1022627411411

37. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011 Oct;12:2825–2830.

38. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467. arXiv; 2016.

39. Delong ER, Delong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988 Sep;44(3):837–845. doi: 10.2307/2531595 3203132

40. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011 Mar 17;12(1):77. doi: 10.1186/1471-2105-12-77 21414208

41. National Cancer Institute, Surveillance, Epidemiology, and End Results Program [Internet]. Cancer Stat Facts: Female Breast Cancer; [cited 2019 Sep 20]. Available from:

42. U.S. Census Bureau, Population Division [Internet]. Annual Estimates of the Resident Population for Selected Age Groups by Sex for the United States, States, Counties and Puerto Rico Commonwealth and Municipios: April 1, 2010 to July 1, 2018; 2019 June [cited 2019 Sep 20]. Available from:

43. Steinberg DM, Fine J, Chappell R. Sample size for positive and negative predictive value in diagnostic research using case-control designs. Biostatistics. 2009 Jan;10(1):94–105. doi: 10.1093/biostatistics/kxn018 18556677

Článek vyšel v časopise


2019 Číslo 12