gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output

English version

Autoři: Juhana I. Kammonen ^aff001; Olli-Pekka Smolander ^aff001; Lars Paulin ^aff001; Pedro A. B. Pereira ^aff001; Pia Laine ^aff001; Patrik Koskinen ^aff001; Jukka Jernvall ^aff003; Petri Auvinen ^aff001
Působiště autorů: DNA Sequencing and Genomics Laboratory, Institute of Biotechnology, University of Helsinki, Helsinki, Finland ^aff001; Department of Neurology, Helsinki University Hospital, Helsinki, Finland ^aff002; Evolutionary Phenomics Group, Institute of Biotechnology, University of Helsinki, Helsinki, Finland ^aff003
Vyšlo v časopise: PLoS ONE 14(9)
Kategorie: Research Article
doi: https://doi.org/10.1371/journal.pone.0216885

Souhrn

Unknown sequences, or gaps, are present in many published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while there are many computational tools partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding tool that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps created in the scaffolding, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the scaffolding. gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. We compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser. We conclude that gapFinisher can fill gaps in draft genomes quickly and reliably. In addition, the serial design of gapFinisher makes it scale well from prokaryote genomes to larger genomes with no increase in the computational footprint.

Klíčová slova:

Research and analysis methods – Database and informatics methods – Bioinformatics – Sequence analysis – Sequence alignment – BLAST algorithm – Computational techniques – Computational pipelines – Biology and life sciences – Microbiology – Bacteriology – Bacterial genetics – Bacterial genomics – Microbial genomics – Genetics – Microbial genetics – Genomics – Genome analysis – Sequence assembly tools – Genomics statistics – Computational biology – Genomic libraries – Organisms – Eukaryota – Animals – Vertebrates – Amniotes – Mammals

Zdroje

1. Vasilinetc I, Prjibelski AD, Gurevich A, Korobeynikov A & Pevzner PA. Assembling short reads from jumping libraries with large insert sizes. Bioinformatics, 2015 Oct 15;31(20):3262–8. doi: 10.1093/bioinformatics/btv337 26040456

2. Boetzer M, Henkel CV, Jansen HJ, Butler D & Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 2011;4(27): 578–579.

3. Boetzer M & Pirovano W. Toward almost finished genomes with GapFiller. Genome Biology 2012;13(6): R56. doi: 10.1186/gb-2012-13-6-r56 22731987

4. Li YI & Copley RR. Scaffolding low quality genomes using orthologous protein sequences. Bioinformatics 2013;29(2): 160–165. doi: 10.1093/bioinformatics/bts661 23162087

5. Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, et al. Ensembl 2018. Nucleic Acids Research, 2018, 4;46(D1):D754–D761. doi: 10.1093/nar/gkx1098 29155950

6. English AC, Richards S, Han Y, Wang M, Vee V, Qu J et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PloS ONE, 2012;7(11), e47768. doi: 10.1371/journal.pone.0047768 23185243

7. Kosuqi S, Hirakawa H & Tabata S. GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics, 2015; 31(23):3733–41. doi: 10.1093/bioinformatics/btv465 26261222

8. Piro VC, Faoro H, Weiss VA, Steffens MB, Pedrosa FO, Souza EM et al. FGAP: an automated gap closing tool. BMC Research Notes 2014;7 : 371. doi: 10.1186/1756-0500-7-371 24938749

9. Boetzer M & Pirovano W. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics 2014;15(1): 211.

10. Chaisson MJ & Tessler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 2012;13 : 238. doi: 10.1186/1471-2105-13-238 22988817

11. Laver T, Harrison J, O’Neill PA, Moore K, Farbos A, Paszkiewicz K et al. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomolecular Detection and Quantification 2015;3(3):1–8.

12. Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Research, 2011;13(39): e90.

13. Schirmer M, Ijaz UZ, D’Amore R, Hall N, Sloan WT & Quince C. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Research, 2015;6(43), e37.

14. Altschul SF, Gish W, Miller W, Myers EW & Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology, 1990;215(3):403–10. doi: 10.1016/S0022-2836(05)80360-2 2231712

15. Salmela L, Sahlin K, Mäkinen V & Tomescu A. Gap Filling as Exact Path Length Problem. Journal of Computational Biology 2016;23(5):347–61. doi: 10.1089/cmb.2015.0197 26959081

16. Gentzsch W. Sun Grid Engine: Towards Creating a Compute Power Grid. In: CCGRID '01: Proceedings of the 1st International Symposium on Cluster Computing and the Grid. 2001;35.

17. Christiansen T, Orwant J, Wall L, Foy B. Programming Perl. O’Reilly Media 2012.

18. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C et al. Versatile and open software for comparing large genomes. Genome biology 2004; 5(2):R12. doi: 10.1186/gb-2004-5-2-r12 14759262

19. Langmead B & Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods, 2012;9(4):357–359. doi: 10.1038/nmeth.1923 22388286

20. Noé L & Kucherov G. YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research 2005 33(1): W540–3.

21. de Koning AJ, Gu W, Castoe TA, Batzer MA & Pollock DD. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genetics, 2011;7(12), e1002384. doi: 10.1371/journal.pgen.1002384 22144907

22. Smit AFA, Hubley R & Green P. 2013–2015. RepeatMasker Open-4.0. Retrieved from: Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013–2015. Available from: http://www.repeatmasker.org (11 Feb 2019, date last accessed)

23. Li H & Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324 19451168

24. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS ONE 2014;9(11): e112963. doi: 10.1371/journal.pone.0112963 25409509

25. Slater GS & Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 2005;6 : 31. doi: 10.1186/1471-2105-6-31 15713233

26. Harhay GP, McVey DS, Koren S, Phillippy AM, Bono J, Harhay DM et al. Complete Closed Genome Sequences of Three Bibersteinia trehalosi Nasopharyngeal Isolates from Cattle with Shipping Fever. Genome announcements 2014;2(1): e00084–14. doi: 10.1128/genomeA.00084-14 24526647

27. Eidam C, Poehlein A, Brenner Michael G, Kadlec K, Liesegang H, Brzuszkiewicz E et al. Complete Genome Sequence of Mannheimia haemolytica Strain 42548 from a Case of Bovine Respiratory Disease. Genome announcements 2013;1(3): e00318–13. doi: 10.1128/genomeA.00318-13 23723408

28. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 2016;32(14):2103–10. doi: 10.1093/bioinformatics/btw152 27153593

29. Magoč T & Salzberg SL. FLASH: fast length adjustment of short reads. Bioinformatics 2011;27(21): 2957–2963. doi: 10.1093/bioinformatics/btr507 21903629

30. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005;437 : 376–380. doi: 10.1038/nature03959 16056220

31. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 2012;19(5): 455–477. doi: 10.1089/cmb.2012.0021 22506599

32. Koren S, Schatz M, Walenz B, Martin J, Howard J, Ganapathy G et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotechnology, 2012;30 : 693–700. doi: 10.1038/nbt.2280 22750884

33. Darling ACE, Mau B, Blattner FR & Perna NT. Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Research, 2004;14(7): 1394–1403. doi: 10.1101/gr.2289704 15231754

34. Dohm JC, Lottaz C, Borodina T & Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research 2008;16(36): e105.

35. Kammonen JI, Smolander OP, Sipilä T, Overmyer K, Auvinen P & Paulin L. Increased transcriptome sequencing efficiency with modified Mint-2 digestion-ligation protocol. Analytical Biochemistry, 2015;477 : 38–40. doi: 10.1016/j.ab.2014.12.001 25513723

36. Camacho C, Madden T, Coulouris G, Avagyan V, Ma N, Tao T et al. BLAST command line applications user manual. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/books/NBK279690 (11 Feb 2019, date last accessed)

Top novinky

Nové kurzy

Top články

Nové číslo

Top novinky

Nejčtenější

Nová videa

Nová videa

Nové podcasty

Doporučené pozice

Top novinky

gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output

Souhrn

Klíčová slova:

Zdroje

PLOS One

Svět praktické medicíny 4/2025 (znalostní test z časopisu)

Denzitometrie v praxi: od kvalitního snímku po správnou interpretaci

Eozinofilie – multioborová otázka?

Čelistně-ortodontické kazuistiky od A do Z

Cesta od prvních příznaků RS k optimální léčbě