A feature selection strategy for gene expression time series experiments with hidden Markov models

Autoři: Roberto A. Cárdenas-Ovando aff001;  Edith A. Fernández-Figueroa aff002;  Héctor A. Rueda-Zárate aff001;  Julieta Noguez aff001;  Claudia Rangel-Escareño aff002
Působiště autorů: School of Engineering and Sciences, Tecnológico de Monterrey, Mexico City, Mexico aff001;  Computational Genomics Lab, Instituto Nacional de Medicina Genómica, Mexico City, Mexico aff002
Vyšlo v časopise: PLoS ONE 14(10)
Kategorie: Research Article
doi: 10.1371/journal.pone.0223183


Studies conducted in time series could be far more informative than those that only capture a specific moment in time. However, when it comes to transcriptomic data, time points are sparse creating the need for a constant search for methods capable of extracting information out of experiments of this kind. We propose a feature selection algorithm embedded in a hidden Markov model applied to gene expression time course data on either single or even multiple biological conditions. For the latter, in a simple case-control study features or genes are selected under the assumption of no change over time for the control samples, while the case group must have at least one change. The proposed model reduces the feature space according to a two-state hidden Markov model. The two states define change/no-change in gene expression. Features are ranked in consonance with three scores: number of changes across time, magnitude of such changes and quality of replicates as a measure of how much they deviate from the mean. An important highlight is that this strategy overcomes the few samples limitation, common in transcriptome experiments through a process of data transformation and rearrangement. To prove this method, our strategy was applied to three publicly available data sets. Results show that feature domain is reduced by up to 90% leaving only few but relevant features yet with findings consistent to those previously reported. Moreover, our strategy proved to be robust, stable and working on studies where sample size is an issue otherwise. Hence, even with two biological replicates and/or three time points our method proves to work well.

Klíčová slova:

Algorithms – Computational pipelines – Diet – Gene expression – Hidden Markov models – Microarrays – Normal distribution – Time measurement


1. Jensen R, Shen Q. Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches. IEEE Press. 2007; 1st Edition.

2. Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J. Subspace, Latent Structure and Feature Selection. Springer. 2006; 1st Edition.

3. Guyon I, Elisseeff A. An introduction to variable and feature selection. JMLR. 2003 Mar;3:1157–1182.

4. Liu H, Motoda H. Computational Methods of Feature Selection. CRC Press.; 2007.

5. Ang JC, Mirzal A, Haron H, Abdull HN. Supervised, Unsupervised and Semi-supervised Feature Selection: A Review on Gene Selection. IEEE Transactions on Computational Biology and Bioinformatics. 2015;13(5):971–989. doi: 10.1109/TCBB.2015.2478454 26390495

6. Ma S, Huang J. Penalized feature selection and classification in bioinformatics. Briefings in Bioinformatics. 2008 June;9(5):392–403. doi: 10.1093/bib/bbn027 18562478

7. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007 Oct;23(19):2507–2517. doi: 10.1093/bioinformatics/btm344 17720704

8. Adams S, Beling P. A survey of feature selection methods for Gaussian mixture models and hidden Markov models. Springer Netherlands.2017:1–41.

9. Jafari P, Azuaje F. An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Med. Inform. Decis. Mak. 2006 June;6(27):1–27.

10. Efron B, Tibshirani R, Storey JD, Tusher V, Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 2001 Dec;96(456):1151–1160. doi: 10.1198/016214501753382129

11. Adams S, Beling P, Cogill R. Feature Selection for hidden Markov models and hidden Semi-Markov models. IEEE. Translations and content mining. 2016 April;4(1):1642–1657.

12. Zhu H, He Z, Leung H. Simultaneous Feature and model Selection for Continuous hidden Markov models. IEEE SIGNAL PROCESSING LETTERS. 2012 May;19(5):279–282. doi: 10.1109/LSP.2012.2190280

13. Law MHC, Figueiredo MAT, Jain AK. Simultaneous feature selection and clustering using mixture models. IEEE Trans. Patt. Anal. Mach. Intell. 2004 July;26(9):1154–1166. doi: 10.1109/TPAMI.2004.71

14. Zheng Y, Jeon B, Sun L, Zhang J, Zhang H. Student’s t-hidden Markov model for Unsupervised Learning Using Localized Feature Selection. IEEE Transactions on Circuits and Systems for Video Technology. 2017 July;9(12):1–10.

15. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets. Nucleic Acids Res. 2013 Jan;41(Database issue):gks119.

16. Uehara T, Ono A, Maruyama T, Kato I, Yamada H, Ohno Y, et al. The Japanese toxicogenomics project: application of toxicogenomics. Molecular nutrition & food research. 2010;54(2):218–227. doi: 10.1002/mnfr.200900169

17. Hernández-de-Diego R, Boix-Chova N, Gómez-Cabrero D, Tegner J, Abugessaisa I, Conesa A STATegra EMS: an Experiment Management System for complex next-generation omics experiments. BMC Systems Biology. 2014 March;88(Suppl 2):S9. doi: 10.1186/1752-0509-8-S2-S9

18. Kwon EY, Shin SK, Cho YY, Jung UJ, Kim E, Park T, et al. Time-course microarrays reveal early activation of the immune transcriptome and adipokine dysregulation leads to fibrosis in visceral adipose depots during diet-induced obesity. BMC Genomics. 2012 April;13(1): 450–465. doi: 10.1186/1471-2164-13-450 22947075

19. Ferreirós-Vidal I, Carroll T, Taylor B, Terry A, Liang Z, Bruno L, et al. Genome-wide identification of Ikaros targets elucidates its contribution to mouse B-cell lineage specification and pre-B-cell differentiation. Blood. 2013;121(10):1769–82. doi: 10.1182/blood-2012-08-450114 23303821

20. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44–57. doi: 10.1038/nprot.2008.211

21. Recknagel RO, Glende EA, Dolak JA, Waller RL. Mechanism of Carbon-tetrachloride Toxicity. Pharmacology & Therapeutics. 1989;43 (43): 139–154. doi: 10.1016/0163-7258(89)90050-8

22. Seifert WF, Bosma A, Brouwer A, Hendriks HF, Roholl PJ, van Leeuwen RE, et al. Vitamin A deficiency potentiates carbon tetrachloride-induced liver fibrosis in rats. Hepatology. 1994;19 (1): 193–201. doi: 10.1002/hep.1840190129 8276355

23. Lee HS, Jung KH, Hong SW, Park IS, Lee C, Hong SS Morin Protects Acute Liver Damage by Carbon Tetrachloride (CCl4) in Rat. Arch Pharm Res. 2008;31(9):1160–1165. doi: 10.1007/s12272-001-1283-5 18806959

24. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353–D36. doi: 10.1093/nar/gkw1092 27899662

25. Sun Y, Li J, Liu J, Chow C, Sun B, Wang R. Using causal discovery for feature selection in multivariate numerical time series. Machine Learning. Springer 2015;101(1): 377–395. doi: 10.1007/s10994-014-5460-1

26. Qian L, Zheng H, Zhou H, Qin R and Li J. Classification of Time Series Gene Expression in Clinical Studies via Integration of Biological Network. PLOS ONE. 2013;8(3):1–12. doi: 10.1371/journal.pone.0058383

27. Hanke J, Wichern D Business Forecasting. Pearson.; 2014.

28. Hyndman R, Kostenko A. Minimum Sample Size Requirements for seasonal forecasting models. Foresight. 2007;1(6):12–15.

29. Aoto Y, Hachiya T, Okumura K, Hase S, Sato K, Wakabayashi Y, et al. DEclust: A statistical approach for obtaining differential expression profiles of multiple conditions. PLOS ONE. 2017;12(11)1:15. doi: 10.1371/journal.pone.0188285

30. Hira Z, Guillies D. A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data. Adv Bioinformatics. 2015; 2015(1):1–13. doi: 10.1155/2015/198363

31. Rabiner LR. A Tutorial on hidden Markov models and Selected Applications in Speech Recognition. Proceedings of the IEEE. 1989;77(2): 257–286. doi: 10.1109/5.18626

32. Beal M, Ghahramani Z, Rasmussen C. The infinite hidden Markov model. NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Dec 2001;1:577–584.

33. Ibe O. Markov Processes for Stochastic modeling. Oxford.; 2009.

34. Bilmes J. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and hidden Markov models International Computer Science Institute.; 1998.

35. Dubitzky W, Granzow M, Berrar D. Fundamentals of data mining in genomics and proteomics. Springer Science & Business Media.; 2007.

Článek vyšel v časopise


2019 Číslo 10