Double triage to identify poorly annotated genes in maize: The missing link in community curation


Autoři: Marcela K. Tello-Ruiz aff001;  Cristina F. Marco aff003;  Fei-Man Hsu aff004;  Rajdeep S. Khangura aff005;  Pengfei Qiao aff006;  Sirjan Sapkota aff007;  Michelle C. Stitzer aff008;  Rachael Wasikowski aff009;  Hao Wu aff010;  Junpeng Zhan aff011;  Kapeel Chougule aff001;  Lindsay C. Barone aff003;  Cornel Ghiban aff003;  Demitri Muna aff001;  Andrew C. Olson aff001;  Liya Wang aff001;  Doreen Ware aff001;  David A. Micklos aff003
Působiště autorů: Plant Biology Program, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America aff001;  Department of Biological Sciences, State University of New York at Old Westbury, Old Westbury, New York, United States of America aff002;  DNA Learning Center, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America aff003;  Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan aff004;  Department of Biochemistry, Purdue University, West Lafayette, Indiana, United States of America aff005;  Plant Biology Section, School of Integrative Plant Sciences, Cornell University, Ithaca, New York, United States of America aff006;  Department of Plant and Environmental Sciences, Clemson University, Clemson, South Carolina, United States of America aff007;  Department of Plant Sciences and Center for Population Biology, University of California Davis, Davis, California, United States of America aff008;  Department of Biological Sciences, University of Toledo, Toledo, Ohio, United States of America aff009;  Genetics, Development & Cell Biology Department, Iowa State University, Ames, Iowa, United States of America aff010;  School of Plant Sciences, University of Arizona, Tucson, Arizona, United States of America aff011;  Donald Danforth Plant Science Center, St. Louis, Missouri, United States of America aff012;  USDA, Agricultural Research Service, Washington, D.C., United States of America aff013
Vyšlo v časopise: PLoS ONE 14(10)
Kategorie: Research Article
doi: 10.1371/journal.pone.0224086

Souhrn

The sophistication of gene prediction algorithms and the abundance of RNA-based evidence for the maize genome may suggest that manual curation of gene models is no longer necessary. However, quality metrics generated by the MAKER-P gene annotation pipeline identified 17,225 of 130,330 (13%) protein-coding transcripts in the B73 Reference Genome V4 gene set with models of low concordance to available biological evidence. Working with eight graduate students, we used the Apollo annotation editor to curate 86 transcript models flagged by quality metrics and a complimentary method using the Gramene gene tree visualizer. All of the triaged models had significant errors–including missing or extra exons, non-canonical splice sites, and incorrect UTRs. A correct transcript model existed for about 60% of genes (or transcripts) flagged by quality metrics; we attribute this to the convention of elevating the transcript with the longest coding sequence (CDS) to the canonical, or first, position. The remaining 40% of flagged genes resulted in novel annotations and represent a manual curation space of about 10% of the maize genome (~4,000 protein-coding genes). MAKER-P metrics have a specificity of 100%, and a sensitivity of 85%; the gene tree visualizer has a specificity of 100%. Together with the Apollo graphical editor, our double triage provides an infrastructure to support the community curation of eukaryotic genomes by scientists, students, and potentially even citizen scientists.

Klíčová slova:

Functional genomics – Genome annotation – Invertebrate genomics – Maize – Phylogenetic analysis – Plant genomics – Sequence alignment – Triage


Zdroje

1. Foreign Agricultural Service, United States Department of Agriculture. All grain summary comparison [Internet]. 2019. Available at https://apps.fas.usda.gov/psdonline/circulars/grain.pdf (p. 15)

2. Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326: 1112–1115. doi: 10.1126/science.1178534 19965430

3. National Human Genome Research Institute. Cost per raw megabase of DNA sequence. 2017. Available at https://www.genome.gov/images/content/costpermb_2017.jpg

4. Barone L, Williams J, Micklos D. Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators. PLS Comput Biol. 2017;13: e1005755. doi: 10.1371/journal.pcbi.1005755 29049281

5. Pennisi E. Ideas fly at gene-finding jamboree. Science. 2000;287: 2182–2184. Available at https://www.ncbi.nlm.nih.gov/pubmed/10744542 doi: 10.1126/science.287.5461.2182 10744542

6. Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell KS, Hradecky P, et al. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 2002;3: RESEARCH0083. Available at https://www.ncbi.nlm.nih.gov/pubmed/12537572

7. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22: 1760–1774. doi: 10.1101/gr.135350.111 22955987

8. Thurmond J, Goodman JL, Strelets VB, Attrill H, Gramates LS, Marygold SJ, et al. FlyBase 2.0: the next generation. Nucleic Acids Res. 2019;47: D759–D765. doi: 10.1093/nar/gky1003 30364959

9. Harris TW, Chen N, Cunningham F, Tello-Ruiz M, Antoshechkin I, Bastiani C, et al. WormBase: a multi-species resource for nematode biology and genomics. Nucleic Acids Res. 2004;32: D411–7. doi: 10.1093/nar/gkh066 14681445

10. Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, et al. The Arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome. Genesis. 2015;53: 474–485. doi: 10.1002/dvg.22877 26201819

11. Reiser L, Berardini TZ, Li D, Muller R, Strait EM, Li Q, et al. Sustainable funding for biocuration: The Arabidopsis Information Resource (TAIR) as a case study of a subscription-based funding model. Database. 2016. 2016. doi: 10.1093/database/baw018 26989150

12. Attwood TK, Agit B, Ellis LBM. Longevity of Biological Databases. EMBnet.journal. 2015;21: 803. doi: 10.14806/ej.21.0.803

13. Crosby MA, Gramates LS, Dos Santos G, Matthews BB, St Pierre SE, Zhou P, et al. Gene Model Annotations for Drosophila melanogaster: The Rule-Benders. G3. 2015;5: 1737–1749. doi: 10.1534/g3.115.018937 26109356

14. Matthews BB, Dos Santos G, Crosby MA, Emmert DB, St Pierre SE, Gramates LS, et al. Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data. G3. 2015;5: 1721–1736. doi: 10.1534/g3.115.018929 26109357

15. Wilkerson MD, Schlueter SD, Brendel V. yrGATE: a web-based gene-structure annotation tool for the identification and dissemination of eukaryotic genes. Genome Biol. 2006;7: R58. doi: 10.1186/gb-2006-7-7-r58 16859520

16. Available at http://www.plantgdb.org/ZmGDB/DisplayProjects.php

17. Eukaryotic Genome Annotation at NCBI. Available at [Internet]. Available at https://www.ncbi.nlm.nih.gov/genome/annotation_euk/

18. Sequence Read Archive. National Center for Biotechnology Information. Available at. https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement.

19. Available at https://www.nsf.gov/awardsearch/showAward?AWD_ID=1445025

20. GENCODE. Statistics about the current GENCODE Release (version 29). Available at https://www.gencodegenes.org/human/stats.html.

21. Kulp D, Haussler D, Reese MG, Eeckman FH. A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol. 1996;4: 134–142. Available at https://www.ncbi.nlm.nih.gov/pubmed/8877513 8877513

22. Nasiri J, Naghavi M, Rad SN, Yolmeh T, Shirazi M, Naderi R, et al. Gene identification programs in bread wheat: a comparison study. Nucleosides Nucleotides Nucleic Acids. 2013;32: 529–554. doi: 10.1080/15257770.2013.832773 24124688

23. Weirather JL, de Cesare M, Wang Y, Piazza P. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. ncbi.nlm.nih.gov; 2017. Available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5553090.2/

24. Salzberg SL. Next-generation genome annotation: we still struggle to get it right. Genome Biology. 2019;20 (92). doi: 10.1186/s13059-019-1715-2 31097009

25. Hosmani PS, Shippy T, Miller S, Benoit JB, Munoz-Torres M et al. A quick guide for student-driven community genome annotation. PLoS Comput. Biol. 2019; 15(4):e1006682. doi: 10.1371/journal.pcbi.1006682 30943207

26. Leung W, Shaffer CD, Reed LK, Smith ST, Barshop W, Dirkes W, et al. Drosophila muller f elements maintain a distinct set of genomic properties over 40 million years of evolution. G3. 2015;4;5(5):719–40. doi: 10.1534/g3.114.015966 25740935

27. Saha S, Hosmani PS, Villalobos-Ayala K, Miller S, Shippy T, Flores M et al. Improved annotation of the insect vector of citrus greening disease: biocuration by a diverse genomics community. Database. 2019. 2019. doi: 10.1093/database/baz035 30820572

28. Jiao Y, Peluso P, Shi J, Liang T, Stitzer MC, Wang B, et al. Improved maize reference genome with single-molecule technologies. Nature. 2017;546: 524–527. doi: 10.1038/nature22971 28605751

29. Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using MAKER and MAKER-P. Curr Protoc Bioinformatics. 2014;48: 4.11.1–39. doi: 10.1002/0471250953.bi0411s48 25501943

30. Eilbeck K, Moore B, Holt C, Yandell M. Quantitative measures for the management and comparison of annotated genomes. BMC Bioinformatics. 2009;10: 67. doi: 10.1186/1471-2105-10-67 19236712

31. Dunn NA, Unni DR, Diesh C, Munoz-Torres M, Harris NL, Yao E, et al. Apollo: Democratizing genome annotation. PLoS Comput Biol. 2019;15: e1006790. doi: 10.1371/journal.pcbi.1006790 30726205

32. Schnable JC, Freeling M. Genes identified by visible mutant phenotypes show increased bias toward one of two subgenomes of maize. PLoS One. 2011;6: e17855. doi: 10.1371/journal.pone.0017855 21423772

33. Available at https://www.maizegdb.org/associated_genes?type=classical&style=table

34. Tello-Ruiz MK, Naithani S, Stein JC, Gupta P, Campbell M, Olson A, et al. Gramene 2018: unifying comparative genomics and pathway resources for plant research. Nucleic Acids Res. 2018;46: D1181–D1189. doi: 10.1093/nar/gkx1111 29165610

35. Frank MJ, Cartwright HN, Smith LG. Three Brick genes have distinct functions in a common pathway promoting polarized cell division and cell morphogenesis in the maize leaf epidermis. Development. 2003;130(4):753–62. doi: 10.1242/dev.00290 12506005

36. Escobar B, de Cárcer G, Fernández-Miranda G, Cascon A, Bravo-Cordero JJ, Montoya MC, et al. Brick1 is an essential regulator of actin cytoskeleton required for embryonic development and cell transformation. Cancer Res. 2010; 15; 70(22):9349–59. doi: 10.1158/0008-5472.CAN-09-4491 20861187

37. Juárez-Colunga S, López-González C, Morales-Elías NC, Massange-Sánchez JA, Trachsel S, Tiessen A. Genome-wide analysis of the invertase gene family from maize. Plant Mol Biol. 2018;97: 385–406. doi: 10.1007/s11103-018-0746-5 29948658

38. Sturm A. Invertases. Primary structures, functions, and roles in plant development and sucrose partitioning. Plant Physiol. 1999;121: 1–8. Available at https://www.ncbi.nlm.nih.gov/pubmed/10482654 doi: 10.1104/pp.121.1.1 10482654

39. Verhaest M, Lammens W, Le Roy K, De Coninck B, De Ranter CJ, Van Laere A, et al. X-ray diffraction structure of a cell-wall invertase from Arabidopsis thaliana. Acta Crystallogr D Biol Crystallogr. 2006;62: 1555–1563. doi: 10.1107/S0907444906044489 17139091

40. Yao Y, Geng M-T, Wu X-H, Liu J, Li R-M, Hu X-W, et al. Genome-wide identification, 3D modeling, expression and enzymatic activity analysis of cell wall invertase gene family from cassava (Manihot esculenta Crantz). Int J Mol Sci. Multidisciplinary Digital Publishing Institute; 2014;15: 7313–7331. Available at https://www.mdpi.com/1422-0067/15/5/7313/htm doi: 10.3390/ijms15057313 24786092

41. Yao Y, Geng M-T, Wu X-H, Liu J, Li R-M, Hu X-W, et al. Genome-Wide Identification, Expression, and Activity Analysis of Alkaline/Neutral Invertase Gene Family from Cassava (Manihot esculenta Crantz). Plant Mol Biol Rep. 2015;33: 304–315. doi: 10.1007/s11105-014-0743-z

42. Cunningham F, Achuthan P, Akanni W, Allen J, Amode MR, Armean IM, et al. Ensembl 2019. Nucleic Acids Res. 2019;47: D745–D751. doi: 10.1093/nar/gky1113 30407521

43. Soderlund C, Descour A, Kudrna D, Bomhoff M, Boyd L, Currie J, et al. Sequencing, mapping, and analysis of 27,455 maize full-length cDNAs. PLoS Genet. 2009;5: e1000740. doi: 10.1371/journal.pgen.1000740 19936069

44. Law M, Childs KL, Campbell MS, Stein JC, Olson AJ, Holt C, et al. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes. Plant Physiol. 2015;167: 25–39. doi: 10.1104/pp.114.245027 25384563

45. Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, et al. Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Commun. 2016;7: 11708. doi: 10.1038/ncomms11708 27339440

46. Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M, et al. Ensembl comparative genomics resources. Database. 2016;2016. doi: 10.1093/database/baw053 27141089


Článek vyšel v časopise

PLOS One


2019 Číslo 10