A graph-based algorithm for RNA-seq data normalization

Autoři: Diem-Trang Tran aff001;  Aditya Bhaskara aff001;  Balagurunathan Kuberan aff002;  Matthew Might aff004
Působiště autorů: School of Computing, University of Utah, Salt Lake City, Utah, United States of America aff001;  Department of Medicinal Chemistry, University of Utah, Salt Lake City, Utah, United States of America aff002;  Department of Biology, University of Utah, Salt Lake City, Utah, United States of America aff003;  Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, Alabama, United States of America aff004
Vyšlo v časopise: PLoS ONE 15(1)
Kategorie: Research Article
doi: 10.1371/journal.pone.0227760


The use of RNA-sequencing has garnered much attention in recent years for characterizing and understanding various biological systems. However, it remains a major challenge to gain insights from a large number of RNA-seq experiments collectively, due to the normalization problem. Normalization has been challenging due to an inherent circularity, requiring that RNA-seq data be normalized before any pattern of differential (or non-differential) expression can be ascertained; meanwhile, the prior knowledge of non-differential transcripts is crucial to the normalization process. Some methods have successfully overcome this problem by the assumption that most transcripts are not differentially expressed. However, when RNA-seq profiles become more abundant and heterogeneous, this assumption fails to hold, leading to erroneous normalization. We present a normalization procedure that does not rely on this assumption, nor prior knowledge about the reference transcripts. This algorithm is based on a graph constructed from intrinsic correlations among RNA-seq transcripts and seeks to identify a set of densely connected vertices as references. Application of this algorithm on our synthesized validation data showed that it could recover the reference transcripts with high precision, thus resulting in high-quality normalization. On a realistic data set from the ENCODE project, this algorithm gave good results and could finish in a reasonable time. These preliminary results imply that we may be able to break the long persisting circularity problem in RNA-seq normalization.

Klíčová slova:

Algorithms – Clustering algorithms – Gene expression – Gene pool – RNA sequencing – Sequence alignment – Signal transduction – Transcriptome analysis


1. Wang Z, Gerstein M, Snyder M. RNA-Seq: A Revolutionary Tool for Transcriptomics. Nature Reviews Genetics. 2009;10(1):57–63. doi: 10.1038/nrg2484 19015660

2. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of Statistical Methods for Normalization and Differential Expression in mRNA-Seq Experiments. BMC Bioinformatics. 2010;11:94. doi: 10.1186/1471-2105-11-94 20167110

3. Robinson MD, Oshlack A. A Scaling Normalization Method for Differential Expression Analysis of RNA-Seq Data. Genome Biology. 2010;11:R25. doi: 10.1186/gb-2010-11-3-r25 20196867

4. Anders S, Huber W. Differential Expression Analysis for Sequence Count Data. Genome Biology. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106 20979621

5. Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, Testing, and False Discovery Rate Estimation for RNA-Sequencing Data. Biostatistics. 2012;13(3):523–538. doi: 10.1093/biostatistics/kxr031 22003245

6. Kadota K, Nishiyama T, Shimizu K. A Normalization Strategy for Comparing Tag Count Data. Algorithms for Molecular Biology. 2012;7:5. doi: 10.1186/1748-7188-7-5 22475125

7. Sun J, Nishiyama T, Shimizu K, Kadota K. TCC: An R Package for Comparing Tag Count Data with Robust Normalization Strategies. BMC Bioinformatics. 2013;14(1):219. doi: 10.1186/1471-2105-14-219 23837715

8. Zhuo B, Emerson S, Chang JH, Di Y. Identifying Stably Expressed Genes from Multiple RNA-Seq Data Sets. PeerJ. 2016;4. doi: 10.7717/peerj.2791

9. Chen C, Shih T, Pai T, Liu Z, Chang MD, Hu C. Gene Expression Rate Comparison for Multiple High-Throughput Datasets. IET Systems Biology. 2013;7(5):135–142. doi: 10.1049/iet-syb.2012.0060 24067413

10. Huggett J, Dheda K, Bustin S, Zumla A. Real-Time RT-PCR Normalisation; Strategies and Considerations. Genes and Immunity. 2005;6(4):279–284. doi: 10.1038/sj.gene.6364190 15815687

11. Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, et al. Synthetic Spike-in Standards for RNA-Seq Experiments. Genome Research. 2011;21(9):1543–1551. doi: 10.1101/gr.121095.111 21816910

12. Chen K, Hu Z, Xia Z, Zhao D, Li W, Tyler JK. The Overlooked Fact: Fundamental Need for Spike-In Control for Virtually All Genome-Wide Analyses. Molecular and Cellular Biology. 2016;36(5):662–667. doi: 10.1128/MCB.00970-14

13. Chen CM, Lu YL, Sio CP, Wu GC, Tzou WS, Pai TW. Gene Ontology Based Housekeeping Gene Selection for RNA-Seq Normalization. Methods. 2014;67(3):354–363. doi: 10.1016/j.ymeth.2014.01.019 24561167

14. Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, et al. A Comprehensive Evaluation of Normalization Methods for Illumina High-Throughput RNA Sequencing Data Analysis. Briefings in Bioinformatics. 2013;14(6):671–683. doi: 10.1093/bib/bbs046 22988256

15. Lin Y, Golovnina K, Chen ZX, Lee HN, Negron YLS, Sultana H, et al. Comparison of Normalization and Differential Expression Analyses Using RNA-Seq Data from 726 Individual Drosophila Melanogaster. BMC Genomics. 2016;17. doi: 10.1186/s12864-015-2353-z

16. Evans C, Hardin J, Stoebel DM. Selecting Between-Sample RNA-Seq Normalization Methods from the Perspective of Their Assumptions. Briefings in Bioinformatics. 2017;. doi: 10.1093/bib/bbx008

17. Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, et al. Accurate Normalization of Real-Time Quantitative RT-PCR Data by Geometric Averaging of Multiple Internal Control Genes. Genome biology. 2002;3(7):research0034–1. doi: 10.1186/gb-2002-3-7-research0034 12184808

18. Eppstein D, Löffler M, Strash D. Listing All Maximal Cliques in Sparse Graphs in Near-Optimal Time. arXiv:10065440 [cs]. 2010;.

19. Frey BJ, Dueck D. Clustering by Passing Messages Between Data Points. Science. 2007;315(5814):972–976. doi: 10.1126/science.1136800 17218491

20. Bodenhofer U, Kothmeier A, Hochreiter S. APCluster: An R Package for Affinity Propagation Clustering. Bioinformatics. 2011;27(17):2463–2464. doi: 10.1093/bioinformatics/btr406 21737437

21. The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247 22955616

22. Li B, Dewey CN. RSEM: Accurate Transcript Quantification from RNA-Seq Data with or without a Reference Genome. BMC Bioinformatics. 2011;12:323. doi: 10.1186/1471-2105-12-323 21816040

23. Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The Reactome Pathway Knowledgebase. Nucleic Acids Research. 2018;46(D1):D649–D655. doi: 10.1093/nar/gkx1132 29145629

24. Lin S, Lin Y, Nery JR, Urich MA, Breschi A, Davis CA, et al. Comparison of the Transcriptional Landscapes between Human and Mouse Tissues. Proceedings of the National Academy of Sciences of the United States of America. 2014;111(48):17224–17229. doi: 10.1073/pnas.1413624111 25413365

25. Gilad Y, Mizrahi-Man O. A Reanalysis of Mouse ENCODE Comparative Gene Expression Data. F1000Research. 2015;4:121. doi: 10.12688/f1000research.6536.1 26236466

26. Mizrahi-Man O, Gilad Y. Data Files and Codes Used in the Reanalysis of the Mouse Encode Comparative Gene Expression Data; 2015. Zenodo.

27. Johnson WE, Li C, Rabinovic A. Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods. Biostatistics. 2007;8(1):118–127. doi: 10.1093/biostatistics/kxj037 16632515

28. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The Sva Package for Removing Batch Effects and Other Unwanted Variation in High-Throughput Experiments. Bioinformatics. 2012;28(6):882–883. doi: 10.1093/bioinformatics/bts034 22257669

29. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data. Nature Reviews Genetics. 2010;11(10):733–739. doi: 10.1038/nrg2825 20838408

30. Goh WWB, Wang W, Wong L. Why Batch Effects Matter in Omics Data, and How to Avoid Them. Trends in Biotechnology. 2017;35(6):498–507. doi: 10.1016/j.tibtech.2017.02.012 28351613

31. Love MI, Anders S, Kim V, Huber W. RNA-Seq Workflow: Gene-Level Exploratory Analysis and Differential Expression. F1000Research. 2016;4:1070. doi: 10.12688/f1000research.7035.1

32. Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-Seq Data Using Factor Analysis of Control Genes or Samples. Nature Biotechnology. 2014;32(9):896–902. doi: 10.1038/nbt.2931 25150836

33. Hansen KD, Irizarry RA, Wu Z. Removing Technical Variability in RNA-Seq Data Using Conditional Quantile Normalization. Biostatistics. 2012;13(2):204–216. doi: 10.1093/biostatistics/kxr054 22285995

34. Hicks SC, Okrah K, Paulson JN, Quackenbush J, Irizarry RA, Bravo HC. Smooth Quantile Normalization. Biostatistics. 2018;19(2):185–198. doi: 10.1093/biostatistics/kxx028 29036413

35. Risso D, Ngai J, Speed TP, Dudoit S. The Role of Spike-In Standards in the Normalization of RNA-Seq. In: Datta S, Nettleton D, editors. Statistical Analysis of Next Generation Sequencing Data. Frontiers in Probability and the Statistical Sciences. Cham: Springer International Publishing; 2014. p. 169–190.

36. Zhang B, Horvath S. A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology. 2005;4:Article17. doi: 10.2202/1544-6115.1128 16646834

37. Song L, Langfelder P, Horvath S. Comparison of Co-Expression Measures: Mutual Information, Correlation, and Model Based Indices. BMC Bioinformatics. 2012;13(1):328. doi: 10.1186/1471-2105-13-328 23217028

38. Love MI, Huber W, Anders S. Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2. Genome Biology. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8 25516281

39. Kang G, Du L, Zhang H. multiDE: A Dimension Reduced Model Based Statistical Method for Differential Expression Analysis Using RNA-Sequencing Data with Multiple Treatment Conditions. BMC Bioinformatics. 2016;17(1):248. doi: 10.1186/s12859-016-1111-9 27334001

40. Aoto Y, Hachiya T, Okumura K, Hase S, Sato K, Wakabayashi Y, et al. DEclust: A Statistical Approach for Obtaining Differential Expression Profiles of Multiple Conditions. PLOS ONE. 2017;12(11):e0188285. doi: 10.1371/journal.pone.0188285 29161291

41. Robinson MD, McCarthy DJ, Smyth GK. edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data. Bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616 19910308

42. Zhou X, Oshlack A, Robinson MD. miRNA-Seq Normalization Comparisons Need Improvement. RNA. 2013;19(6):733–734. doi: 10.1261/rna.037895.112 23616640

Článek vyšel v časopise


2020 Číslo 1