Gene expression based survival prediction for cancer patients—A topic modeling approach

Autoři: Luke Kumar aff001;  Russell Greiner aff001
Působiště autorů: Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada aff001;  Alberta Machine Intelligence Institute (Amii), Edmonton, Alberta, Canada aff002
Vyšlo v časopise: PLoS ONE 14(11)
Kategorie: Research Article
doi: 10.1371/journal.pone.0224446


Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient’s cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model—e.g., to accommodate the real-valued expression values—leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional “distribution vector” as input to a learning algorithm—here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this “dLDA+MTLR” approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values—here using the mRNAseq modality—and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent “D-calibrated” measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at

Klíčová slova:

Algorithms – Breast cancer – Gene expression – Machine learning algorithms – Microarrays – principal component analysis – Subroutines


1. Stewart B, Wild CP, et al. World cancer report 2014. Health. 2017.

2. Van’t Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. nature. 2002;415(6871):530–536. doi: 10.1038/415530a

3. Margolin AA, Bilal E, Huang E, Norman TC, Ottestad L, Mecham BH, et al. Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Science translational medicine. 2013;5(181):181re1–181re1. doi: 10.1126/scitranslmed.3006112 23596205

4. Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of clinical oncology. 2009;27(8):1160–1167. doi: 10.1200/JCO.2008.18.1370 19204204

5. Naderi A, Teschendorff A, Barbosa-Morais N, Pinder S, Green A, Powe D, et al. A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene. 2007;26(10):1507–1516. doi: 10.1038/sj.onc.1209920 16936776

6. Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature medicine. 2002;8(8):816–824. doi: 10.1038/nm733 12118244

7. Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346–352. doi: 10.1038/nature10983 22522925

8. Altman DG. Practical statistics for medical research. CRC; 1990.

9. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random Survival Forests. The Annals of Applied Statistics. 2008;2:841–860. doi: 10.1214/08-AOAS169

10. Khan FM, Zubek VB. Support vector regression for censored data (SVRc): a novel tool for survival analysis. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE; 2008. p. 863–868.

11. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. vol. 360. John Wiley & Sons; 2011.

12. Guinney J, Wang T, Laajala TD, Winner KK, Bare JC, Neto EC, et al. Prediction of overall survival for patients with metastatic castration-resistant prostate cancer: development of a prognostic model through a crowdsourced challenge with open clinical trial data. The Lancet Oncology. 2016. doi: 10.1016/S1470-2045(16)30560-5 27864015

13. Cheng WY, Yang THO, Anastassiou D. Development of a prognostic model for breast cancer survival in an open challenge environment. Science translational medicine. 2013;5(181):181ra50–181ra50. doi: 10.1126/scitranslmed.3005974 23596202

14. Yu CN, Greiner R, Lin HC, Baracos V. Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors. In: Neural Information Processing Systems (NIPS); 2011. p. 1845–1853.

15. Andres A, Montano-Loza A, Greiner R, Uhlich M, Jin P, Hoehn B, et al. A novel learning algorithm to predict individual survival after liver transplantation for primary sclerosing cholangitis. PLoS One. 2018. doi: 10.1371/journal.pone.0193523

16. Haider H, Hoehn B, Davis S, Greiner R. Effective Ways to Build and Evaluate Individual Survival Distributions. arXiv preprint arXiv:181111347. 2018.

17. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. the Journal of machine Learning research. 2003;3:993–1022.

18. Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q, et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 2015;16:35. doi: 10.1186/s13059-015-0602-8 25786235

19. Rogers S, Girolami M, Campbell C, Breitling R. The latent process decomposition of cDNA microarray data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2005;2(2):143–156. doi: 10.1109/TCBB.2005.29

20. Masada T, Hamada T, Shibata Y, Oguri K. Bayesian multi-topic microarray analysis with hyperparameter reestimation. In: International Conference on Advanced Data Mining and Applications. Springer; 2009. p. 253–264.

21. Bicego M, Lovato P, Perina A, Fasoli M, Delledonne M, Pezzotti M, et al. Investigating topic models’ capabilities in expression microarray data classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2012;9(6):1831–1836. doi: 10.1109/TCBB.2012.121

22. Liu L, Tang L, Dong W, Yao S, Zhou W. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus. 2016;5(1):1608. doi: 10.1186/s40064-016-3252-8 27652181

23. Hofmann T. Unsupervised learning by probabilistic latent semantic analysis. Machine learning. 2001;42(1-2):177–196. doi: 10.1023/A:1007617005950

24. Dawson JA, Kendziorski C. Survival-supervised latent Dirichlet allocation models for genomic analysis of time-to-event outcomes. arXiv preprint arXiv:12025999. 2012.

25. McAuliffe JD, Blei DM. Supervised topic models. In: Advances in neural information processing systems; 2008. p. 121–128.

26. Cox DR. Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B (Methodological). 1972;34(2):187–220. doi: 10.1111/j.2517-6161.1972.tb00899.x

27. McCullagh P, Nelder JA. Generalized linear models. vol. 37. CRC; 1989.

28. Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, et al. Assessing gene significance from cDNA microarray expression data via mixed models. Journal of computational biology. 2001;8(6):625–637. doi: 10.1089/106652701753307520 11747616

29. Analysis Overview for Pan-kidney cohort (KICH+KIRC+KIRP) (Primary solid tumor cohort). Broad Institute TCGA Genome Data Analysis Center (2016). 28 January 2016.

30. Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression. vol. 398. John Wiley & Sons; 2013.

31. Yousefi S, Amrollahi F, Amgad M, Dong C, Lewis JE, Song C, et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Scientific Reports. 2017;7(1):11707. doi: 10.1038/s41598-017-11817-6 28916782

32. Steck H, Krishnapuram B, Dehing-oberije C, Lambin P, Raykar VC. On ranking in survival analysis: Bounds on the concordance index. In: Advances in neural information processing systems; 2008. p. 1209–1216.

33. Simon N, Friedman J, Hastie T, Tibshirani R. Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software. 2011;39(5):1–13. doi: 10.18637/jss.v039.i05 27065756

34. Van Wieringen WN, Kun D, Hampel R, Boulesteix AL. Survival prediction using gene expression data: a review and comparison. Computational statistics & data analysis. 2009;53(5):1590–1603. doi: 10.1016/j.csda.2008.05.021

35. Goeman JJ. L1 penalized estimation in the Cox proportional hazards model. Biometrical journal. 2010;52(1):70–84. doi: 10.1002/bimj.200900028 19937997

36. Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2(4):e108. doi: 10.1371/journal.pbio.0020108 15094809

37. Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology. 2005;3(02):185–205. doi: 10.1142/S0219720005001004 15852500

38. Hoffman M, Bach FR, Blei DM. Online learning for latent dirichlet allocation. In: advances in neural information processing systems; 2010. p. 856–864.




Článek vyšel v časopise


2019 Číslo 11