HaTSPiL: A modular pipeline for high-throughput sequencing data analysis


Autoři: Edoardo Morandi aff001;  Matteo Cereda aff002;  Danny Incarnato aff001;  Caterina Parlato aff002;  Giulia Basile aff002;  Francesca Anselmi aff001;  Andrea Lauria aff001;  Lisa Marie Simon aff001;  Isabelle Laurence Polignano aff001;  Francesca Arruga aff002;  Silvia Deaglio aff002;  Elisa Tirtei aff004;  Franca Fagioli aff004;  Salvatore Oliviero aff001
Působiště autorů: Department of Life Sciences and System Biology, University of Turin, Turin, Italy aff001;  Italian Institute for Genomic Medicine (IIGM), Turin, Italy aff002;  Department of Medical Sciences, University of Turin, Turin, Italy aff003;  Paediatric Onco-Haematology, Stem Cell Transplantation and Cellular Therapy Division, City of Science and Health of Turin, Regina Margherita Children’s Hospital, Turin, Italy aff004
Vyšlo v časopise: PLoS ONE 14(10)
Kategorie: Research Article
doi: 10.1371/journal.pone.0222512

Souhrn

Background

Next generation sequencing methods are widely adopted for a large amount of scientific purposes, from pure research to health-related studies. The decreasing costs per analysis led to big amounts of generated data and to the subsequent improvement of software for the respective analyses. As a consequence, many approaches have been developed to chain different software in order to obtain reliable and reproducible workflows. However, the large range of applications for NGS approaches entails the challenge to manage many different workflows without losing reliability.

Methods

We here present a high-throughput sequencing pipeline (HaTSPiL), a Python-powered CLI tool designed to handle different approaches for data analysis with a high level of reliability. The software relies on the barcoding of filenames using a human readable naming convention that contains any information regarding the sample needed by the software to automatically choose different workflows and parameters. HaTSPiL is highly modular and customisable, allowing the users to extend its features for any specific need.

Conclusions

HaTSPiL is licensed as Free Software under the MIT license and it is available at https://github.com/dodomorandi/hatspil.

Klíčová slova:

Computer software – Next-generation sequencing – Programming languages – Research validity – Software design – Software tools – User interfaces – Mutational analysis


Zdroje

1. Goecks J, Nekrutenko A, Taylor J, Team G. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology. 2010;11:R86. doi: 10.1186/gb-2010-11-8-r86 20738864

2. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, et al. Taverna: a tool for building and running workflows of services. Nucleic acids research. 2006;34:W729–W732. doi: 10.1093/nar/gkl320 16845108

3. Halbritter F, Vaidya HJ, Tomlinson SR. GeneProf: analysis of high-throughput sequencing experiments. Nature methods. 2011;9:7–8. doi: 10.1038/nmeth.1809 22205509

4. Desvillechabrol D, Legendre R, Rioualen C, Bouchier C, van Helden J, Kennedy S, et al. Sequanix: a dynamic graphical interface for Snakemake workflows. Bioinformatics (Oxford, England). 2018;34:1934–1936. doi: 10.1093/bioinformatics/bty034

5. Goodstadt L. Ruffus: a lightweight Python library for computational pipelines. Bioinformatics (Oxford, England). 2010;26:2778–2779. doi: 10.1093/bioinformatics/btq524

6. Mishima H, Sasaki K, Tanaka M, Tatebe O, Yoshiura KI. Agile parallel bioinformatics workflow management using Pwrake. BMC research notes. 2011;4:331. doi: 10.1186/1756-0500-4-331 21899774

7. Taura K, Matsuzaki T, Miwa M, Kamoshida Y, Yokoyama D, Dun N, et al. Design and Implementation of GXP Make—A Workflow System Based on Make. Future Gener Comput Syst. 2013;29(2):662–672. doi: 10.1016/j.future.2011.05.026

8. Cingolani P, Sladek R, Blanchette M. BigDataScript: a scripting language for data pipelines. Bioinformatics (Oxford, England). 2015;31:10–16. doi: 10.1093/bioinformatics/btu595

9. Sadedin SP, Pope B, Oshlack A. Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics (Oxford, England). 2012;28:1525–1526. doi: 10.1093/bioinformatics/bts167

10. Köster J, Rahmann S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics (Oxford, England). 2018;34:3600. doi: 10.1093/bioinformatics/bty350

11. Leipzig J. A review of bioinformatic pipeline frameworks. Briefings in bioinformatics. 2017;18:530–536. doi: 10.1093/bib/bbw020 27013646

12. Silva TC, Colaprico A, Olsen C, D’Angelo F, Bontempi G, Ceccarelli M, et al. TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages. F1000Research. 2016;5:1542. doi: 10.12688/f1000research.8923.1 28232861

13. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17(1):10.

14. Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PloS one. 2016;11:e0163962. doi: 10.1371/journal.pone.0163962 27706213

15. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England). 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324

16. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England). 2013;29:15–21. doi: 10.1093/bioinformatics/bts635

17. Conway T, Wazny J, Bromage A, Tymms M, Sooraj D, Williams ED, et al. Xenome–a tool for classifying reads from xenograft samples. Bioinformatics (Oxford, England). 2012;28:i172–i178. doi: 10.1093/bioinformatics/bts236

18. Ahdesmäki MJ, Gray SR, Johnson JH, Lai Z. Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples. F1000Research. 2016;5:2741. doi: 10.12688/f1000research.10082.1 27990269

19. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research. 2010;20:1297–1303. doi: 10.1101/gr.107524.110 20644199

20. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature biotechnology. 2013;31:213–219. doi: 10.1038/nbt.2514 23396013

21. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research. 2012;22:568–576. doi: 10.1101/gr.129684.111 22300766

22. Saunders CT, Wong WSW, Swamy S, Becq J, Murray LJ, Cheetham RK. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics (Oxford, England). 2012;28:1811–1817. doi: 10.1093/bioinformatics/bts271

23. Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic acids research. 2017;45:D777–D783. doi: 10.1093/nar/gkw1121 27899578

24. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research. 2001;29:308–311. doi: 10.1093/nar/29.1.308 11125122

25. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic acids research. 2018;46:D1062–D1067. doi: 10.1093/nar/gkx1153 29165669


Článek vyšel v časopise

PLOS One


2019 Číslo 10