ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples

Autoři: Ardi Tampuu aff001;  Zurab Bzhalava aff002;  Joakim Dillner aff002;  Raul Vicente aff001
Působiště autorů: Computational Neuroscience Lab, Institute of Computer Science, University of Tartu, Tartu, Estonia aff001;  Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden aff002;  Karolinska University Laboratory, Karolinska University Hospital, Stockholm, Sweden aff003
Vyšlo v časopise: PLoS ONE 14(9)
Kategorie: Research Article
doi: 10.1371/journal.pone.0222271


Despite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as “unknown” since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as “unknown” by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases.

Biology and life sciences – Genetics – Genomics – Metagenomics – Viral genomics – Viral genome – Microbiology – Microbial genomics – Virology – Molecular biology – Molecular biology techniques – Molecular biology assays and analysis techniques – DNA filter assay – Neuroscience – Neural networks – Research and analysis methods – Database and informatics methods – Bioinformatics – Sequence analysis – Sequence alignment – BLAST algorithm – DNA sequence analysis – Computer and information sciences – Artificial intelligence – Machine learning


