SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

Autoři: Hamid Mushtaq aff001;  Nauman Ahmed aff001;  Zaid Al-Ars aff001
Působiště autorů: Quantum and Computer Engineering, Delft University of Technology, Delft, The Netherlands aff001
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article
doi: 10.1371/journal.pone.0224784


Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark’s ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at

Klíčová slova:

Computational pipelines – Data compression – Data processing – Genome analysis – Chromosome mapping – Memory – Next-generation sequencing – DNA analysis


1. Zaharia M, Chowdhury M, Franklin MJ, Shenker S and Stoica I. “Spark: cluster computing with working sets”, HotCloud’10, USENIX Association, Berkeley, CA, USA.

2. Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, et al. “From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline”, Current Protocols in Bioinformatics, 43:11.10.1–11.10.33, 2013.

3. Mushtaq H, Liu F, Costa C, Liu G, Hofstee P and Al-Ars Z. “SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale”, Proc. ACM Conference Bioinformatics, Computational Biology and Health Informatics, 2017.

4. Jones DC, Ruzzo WL, Peng X and Katze MG. “Compression of next-generation sequencing reads aided by highly efficient de novo assembly”, Nucleic Acids Research, 2012. doi: 10.1093/nar/gks754

5. Langmead B and Salzberg SL. “Fast gapped-read alignment with Bowtie 2”, Nature Methods, vol. 9, no. 4, pp. 357–359, 2012. doi: 10.1038/nmeth.1923

6. Li H. “Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM”, arXiv:1303.3997 [q-bio.GN], 2013.

7. Kelly BJ, Fitch JR, Hu Y, Corsmeier DJ, Zhong H, Wetzel AN, et al. “Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics”, Genome Biology, vol. 16, no. 6, 2015.

8. Decap D, Reumers J, Herzeel C, Costanza P and Fostier J. “Halvade: scalable sequence analysis with MapReduce”, Bioinformatics, btv179v2–btv179, 2015.

9. Deng L, Huang G, Zhuang Y, Wei J and Yan Y. “HiGene: A high-performance platform for genomic data analysis”, Proc. IEEE Inte’l Conf. Bioinformatics and Biomedicine, (BIBM16), Shenzhen, China, pp. 576–583, 2016.

10. Mushtaq H and Al-Ars Z. “Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline”, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, pp. 1471–1477, 2015.

11. Mushtaq H, Ahmed N and Al-Ars Z. “Streaming Distributed DNA Sequence Alignment Using Apache Spark”, 17th IEEE International Conference on BioInformatics and BioEngineering, 2017.

Článek vyšel v časopise


2019 Číslo 12