LFastqC: A lossless non-reference-based FASTQ compressor


Autoři: Sultan Al Yami aff001;  Chun-Hsi Huang aff001
Působiště autorů: Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, United States of America aff001;  Computer Science and Information System, Najran University, Najran, Saudi Arabia aff002
Vyšlo v časopise: PLoS ONE 14(11)
Kategorie: Research Article
doi: 10.1371/journal.pone.0224806

Souhrn

The cost-effectiveness of next-generation sequencing (NGS) has led to the advancement of genomic research, thereby regularly generating a large amount of raw data that often requires efficient infrastructures such as data centers to manage the storage and transmission of such data. The generated NGS data are highly redundant and need to be efficiently compressed to reduce the cost of storage space and transmission bandwidth. We present a lossless, non-reference-based FASTQ compression algorithm, known as LFastqC, an improvement over the LFQC tool, to address these issues. LFastqC is compared with several state-of-the-art compressors, and the results indicate that LFastqC achieves better compression ratios for important datasets such as the LS454, PacBio, and MinION. Moreover, LFastqC has a better compression and decompression speed than LFQC, which was previously the top-performing compression algorithm for the LS454 dataset. LFastqC is freely available at https://github.uconn.edu/sya12005/LFastqC.

Klíčová slova:

Algorithms – Arithmetic – Compression – Data management – Next-generation sequencing – Nucleotide sequencing – Spring – Data compression


Zdroje

1. Ewing B, Hillier L, Wendl MC and Green P. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Research. 1998;8(3): 175–185. doi: 10.1101/gr.8.3.175 9521921

2. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research. 2009;38(6): 1767–1771. doi: 10.1093/nar/gkp1137 20015970

3. Pinho AJ, and Pratas D. MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics. 2013;30(1): 117–118. doi: 10.1093/bioinformatics/btt594 24132931

4. Pinho AJ, Ferreira PJ, Neves AJ, Bastos CA. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLOS One. 2011;6(6): e21588. doi: 10.1371/journal.pone.0021588 21738720

5. Li P, Wang S, Kim J, Xiong H, Ohno-Machado L, Jiang X. DNA-COMPACT: DNA compression based on a pattern-aware contextual modeling technique. PLOS One. 2013;8(11): e80377. doi: 10.1371/journal.pone.0080377 24282536

6. Sardaraz M, Tahir M, Ikram AA, Bajwa H. SeqCompress: An algorithm for biological sequence compression. Genomics. 2014;104(4): 225–228. doi: 10.1016/j.ygeno.2014.08.007 25173568

7. Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format. Bioinformatics. 2011;27(6): 860–862. doi: 10.1093/bioinformatics/btr014 21252073

8. Roguski Ł, Deorowicz S. DSRC 2—Industry-oriented compression of FASTQ files. Bioinformatics. 2014;30(15): 2213–2215. doi: 10.1093/bioinformatics/btu208 24747219

9. Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research. 2012;40(22): e171–e171. doi: 10.1093/nar/gks754 22904078

10. Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PLOS One. 2013;8(3): e59190. doi: 10.1371/journal.pone.0059190 23533605

11. Nicolae M, Pathak S, Rajasekaran S. LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics. 2015;31(20): 3276–3281. doi: 10.1093/bioinformatics/btv384 26093148

12. Roguski Ł., Ochoa I., Hernaez M., & Deorowicz S. (2018). FaStore: a space-saving solution for raw sequencing data. Bioinformatics, 34(16), 2748–2756. doi: 10.1093/bioinformatics/bty205 29617939

13. Chandak S., Tatwawadi K., Ochoa I., Hernaez M., & Weissman T. (2018). SPRING: a next-generation compressor for FASTQ data. Bioinformatics.

14. Deutsch P. GZIP file format specification version 4.3 (No. RFC 1952). 1996.

15. Seward J. Bzip2. 1996. Available from: http://www.bzip.org/bzip2.html.

16. Armando P. SeqSqueeze1.2012 Available from: https://sourceforge.net/p/ieetaseqsqueeze/


Článek vyšel v časopise

PLOS One


2019 Číslo 11