A graph-based algorithm for RNA-seq data normalization
Autoři:
Diem-Trang Tran aff001; Aditya Bhaskara aff001; Balagurunathan Kuberan aff002; Matthew Might aff004
Působiště autorů:
School of Computing, University of Utah, Salt Lake City, Utah, United States of America
aff001; Department of Medicinal Chemistry, University of Utah, Salt Lake City, Utah, United States of America
aff002; Department of Biology, University of Utah, Salt Lake City, Utah, United States of America
aff003; Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, Birmingham, Alabama, United States of America
aff004
Vyšlo v časopise:
PLoS ONE 15(1)
Kategorie:
Research Article
doi: 10.1371/journal.pone.0227760
Souhrn
The use of RNA-sequencing has garnered much attention in recent years for characterizing and understanding various biological systems. However, it remains a major challenge to gain insights from a large number of RNA-seq experiments collectively, due to the normalization problem. Normalization has been challenging due to an inherent circularity, requiring that RNA-seq data be normalized before any pattern of differential (or non-differential) expression can be ascertained; meanwhile, the prior knowledge of non-differential transcripts is crucial to the normalization process. Some methods have successfully overcome this problem by the assumption that most transcripts are not differentially expressed. However, when RNA-seq profiles become more abundant and heterogeneous, this assumption fails to hold, leading to erroneous normalization. We present a normalization procedure that does not rely on this assumption, nor prior knowledge about the reference transcripts. This algorithm is based on a graph constructed from intrinsic correlations among RNA-seq transcripts and seeks to identify a set of densely connected vertices as references. Application of this algorithm on our synthesized validation data showed that it could recover the reference transcripts with high precision, thus resulting in high-quality normalization. On a realistic data set from the ENCODE project, this algorithm gave good results and could finish in a reasonable time. These preliminary results imply that we may be able to break the long persisting circularity problem in RNA-seq normalization.
Klíčová slova:
Algorithms – Clustering algorithms – Gene expression – Gene pool – RNA sequencing – Sequence alignment – Signal transduction – Transcriptome analysis
Zdroje
1. Wang Z, Gerstein M, Snyder M. RNA-Seq: A Revolutionary Tool for Transcriptomics. Nature Reviews Genetics. 2009;10(1):57–63. doi: 10.1038/nrg2484 19015660
2. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of Statistical Methods for Normalization and Differential Expression in mRNA-Seq Experiments. BMC Bioinformatics. 2010;11:94. doi: 10.1186/1471-2105-11-94 20167110
3. Robinson MD, Oshlack A. A Scaling Normalization Method for Differential Expression Analysis of RNA-Seq Data. Genome Biology. 2010;11:R25. doi: 10.1186/gb-2010-11-3-r25 20196867
4. Anders S, Huber W. Differential Expression Analysis for Sequence Count Data. Genome Biology. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106 20979621
5. Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, Testing, and False Discovery Rate Estimation for RNA-Sequencing Data. Biostatistics. 2012;13(3):523–538. doi: 10.1093/biostatistics/kxr031 22003245
6. Kadota K, Nishiyama T, Shimizu K. A Normalization Strategy for Comparing Tag Count Data. Algorithms for Molecular Biology. 2012;7:5. doi: 10.1186/1748-7188-7-5 22475125
7. Sun J, Nishiyama T, Shimizu K, Kadota K. TCC: An R Package for Comparing Tag Count Data with Robust Normalization Strategies. BMC Bioinformatics. 2013;14(1):219. doi: 10.1186/1471-2105-14-219 23837715
8. Zhuo B, Emerson S, Chang JH, Di Y. Identifying Stably Expressed Genes from Multiple RNA-Seq Data Sets. PeerJ. 2016;4. doi: 10.7717/peerj.2791
9. Chen C, Shih T, Pai T, Liu Z, Chang MD, Hu C. Gene Expression Rate Comparison for Multiple High-Throughput Datasets. IET Systems Biology. 2013;7(5):135–142. doi: 10.1049/iet-syb.2012.0060 24067413
10. Huggett J, Dheda K, Bustin S, Zumla A. Real-Time RT-PCR Normalisation; Strategies and Considerations. Genes and Immunity. 2005;6(4):279–284. doi: 10.1038/sj.gene.6364190 15815687
11. Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, et al. Synthetic Spike-in Standards for RNA-Seq Experiments. Genome Research. 2011;21(9):1543–1551. doi: 10.1101/gr.121095.111 21816910
12. Chen K, Hu Z, Xia Z, Zhao D, Li W, Tyler JK. The Overlooked Fact: Fundamental Need for Spike-In Control for Virtually All Genome-Wide Analyses. Molecular and Cellular Biology. 2016;36(5):662–667. doi: 10.1128/MCB.00970-14
13. Chen CM, Lu YL, Sio CP, Wu GC, Tzou WS, Pai TW. Gene Ontology Based Housekeeping Gene Selection for RNA-Seq Normalization. Methods. 2014;67(3):354–363. doi: 10.1016/j.ymeth.2014.01.019 24561167
14. Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, et al. A Comprehensive Evaluation of Normalization Methods for Illumina High-Throughput RNA Sequencing Data Analysis. Briefings in Bioinformatics. 2013;14(6):671–683. doi: 10.1093/bib/bbs046 22988256
15. Lin Y, Golovnina K, Chen ZX, Lee HN, Negron YLS, Sultana H, et al. Comparison of Normalization and Differential Expression Analyses Using RNA-Seq Data from 726 Individual Drosophila Melanogaster. BMC Genomics. 2016;17. doi: 10.1186/s12864-015-2353-z
16. Evans C, Hardin J, Stoebel DM. Selecting Between-Sample RNA-Seq Normalization Methods from the Perspective of Their Assumptions. Briefings in Bioinformatics. 2017;. doi: 10.1093/bib/bbx008
17. Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, et al. Accurate Normalization of Real-Time Quantitative RT-PCR Data by Geometric Averaging of Multiple Internal Control Genes. Genome biology. 2002;3(7):research0034–1. doi: 10.1186/gb-2002-3-7-research0034 12184808
18. Eppstein D, Löffler M, Strash D. Listing All Maximal Cliques in Sparse Graphs in Near-Optimal Time. arXiv:10065440 [cs]. 2010;.
19. Frey BJ, Dueck D. Clustering by Passing Messages Between Data Points. Science. 2007;315(5814):972–976. doi: 10.1126/science.1136800 17218491
20. Bodenhofer U, Kothmeier A, Hochreiter S. APCluster: An R Package for Affinity Propagation Clustering. Bioinformatics. 2011;27(17):2463–2464. doi: 10.1093/bioinformatics/btr406 21737437
21. The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247 22955616
22. Li B, Dewey CN. RSEM: Accurate Transcript Quantification from RNA-Seq Data with or without a Reference Genome. BMC Bioinformatics. 2011;12:323. doi: 10.1186/1471-2105-12-323 21816040
23. Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The Reactome Pathway Knowledgebase. Nucleic Acids Research. 2018;46(D1):D649–D655. doi: 10.1093/nar/gkx1132 29145629
24. Lin S, Lin Y, Nery JR, Urich MA, Breschi A, Davis CA, et al. Comparison of the Transcriptional Landscapes between Human and Mouse Tissues. Proceedings of the National Academy of Sciences of the United States of America. 2014;111(48):17224–17229. doi: 10.1073/pnas.1413624111 25413365
25. Gilad Y, Mizrahi-Man O. A Reanalysis of Mouse ENCODE Comparative Gene Expression Data. F1000Research. 2015;4:121. doi: 10.12688/f1000research.6536.1 26236466
26. Mizrahi-Man O, Gilad Y. Data Files and Codes Used in the Reanalysis of the Mouse Encode Comparative Gene Expression Data; 2015. Zenodo.
27. Johnson WE, Li C, Rabinovic A. Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods. Biostatistics. 2007;8(1):118–127. doi: 10.1093/biostatistics/kxj037 16632515
28. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The Sva Package for Removing Batch Effects and Other Unwanted Variation in High-Throughput Experiments. Bioinformatics. 2012;28(6):882–883. doi: 10.1093/bioinformatics/bts034 22257669
29. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data. Nature Reviews Genetics. 2010;11(10):733–739. doi: 10.1038/nrg2825 20838408
30. Goh WWB, Wang W, Wong L. Why Batch Effects Matter in Omics Data, and How to Avoid Them. Trends in Biotechnology. 2017;35(6):498–507. doi: 10.1016/j.tibtech.2017.02.012 28351613
31. Love MI, Anders S, Kim V, Huber W. RNA-Seq Workflow: Gene-Level Exploratory Analysis and Differential Expression. F1000Research. 2016;4:1070. doi: 10.12688/f1000research.7035.1
32. Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-Seq Data Using Factor Analysis of Control Genes or Samples. Nature Biotechnology. 2014;32(9):896–902. doi: 10.1038/nbt.2931 25150836
33. Hansen KD, Irizarry RA, Wu Z. Removing Technical Variability in RNA-Seq Data Using Conditional Quantile Normalization. Biostatistics. 2012;13(2):204–216. doi: 10.1093/biostatistics/kxr054 22285995
34. Hicks SC, Okrah K, Paulson JN, Quackenbush J, Irizarry RA, Bravo HC. Smooth Quantile Normalization. Biostatistics. 2018;19(2):185–198. doi: 10.1093/biostatistics/kxx028 29036413
35. Risso D, Ngai J, Speed TP, Dudoit S. The Role of Spike-In Standards in the Normalization of RNA-Seq. In: Datta S, Nettleton D, editors. Statistical Analysis of Next Generation Sequencing Data. Frontiers in Probability and the Statistical Sciences. Cham: Springer International Publishing; 2014. p. 169–190.
36. Zhang B, Horvath S. A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology. 2005;4:Article17. doi: 10.2202/1544-6115.1128 16646834
37. Song L, Langfelder P, Horvath S. Comparison of Co-Expression Measures: Mutual Information, Correlation, and Model Based Indices. BMC Bioinformatics. 2012;13(1):328. doi: 10.1186/1471-2105-13-328 23217028
38. Love MI, Huber W, Anders S. Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2. Genome Biology. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8 25516281
39. Kang G, Du L, Zhang H. multiDE: A Dimension Reduced Model Based Statistical Method for Differential Expression Analysis Using RNA-Sequencing Data with Multiple Treatment Conditions. BMC Bioinformatics. 2016;17(1):248. doi: 10.1186/s12859-016-1111-9 27334001
40. Aoto Y, Hachiya T, Okumura K, Hase S, Sato K, Wakabayashi Y, et al. DEclust: A Statistical Approach for Obtaining Differential Expression Profiles of Multiple Conditions. PLOS ONE. 2017;12(11):e0188285. doi: 10.1371/journal.pone.0188285 29161291
41. Robinson MD, McCarthy DJ, Smyth GK. edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data. Bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616 19910308
42. Zhou X, Oshlack A, Robinson MD. miRNA-Seq Normalization Comparisons Need Improvement. RNA. 2013;19(6):733–734. doi: 10.1261/rna.037895.112 23616640
Článek vyšel v časopise
PLOS One
2020 Číslo 1
- Nový typ fixace umožňuje pravidelnou hygienu končetiny i pobyt ve vodě
- Metamizol jako analgetikum první volby: kdy, pro koho, jak a proč?
- Není statin jako statin aneb praktický přehled rozdílů jednotlivých molekul
- Pregabalin je účinné léčivo s příznivým bezpečnostním profilem pro pacienty s neuropatickou bolestí
- Nedostatek hořčíku se projevuje u stále více lidí