Description |
Advances in sequencing technologies have made it possible to generate large amounts of microbiological sequence data without culture methods. The data generated pose a significant data analysis challenge. This is especially true in clinical diagnostics where accurate and timely diagnoses are key. To enable infectious disease diagnostics, we created Taxonomer, a kmer-based metagenomics software tool, which can rapidly process large amounts of sequence data with accuracy and precision similar to slower alignment-based approaches. A kmer is a nucleotide subsequence of k length. Kmer exact matching is performed in RAM, utilizing data structures with rapid query times, making kmer approaches magnitudes faster than alignment methods. Prior to Taxonomer, other kmer-based methods were subject to high false positive rates. Taxonomer differs by 1) providing a workflow that reduces false-positives, 2) including host-transcript profiling, and 3) providing a novel protein kmer tool to identify viruses, which are typically too divergent to reliably identify using nucleotide sequence. A web-based front-end was created with the D3 enabled iobio framework. Reference sets utilized in Taxonomer were obtained from NCBI, GreenGenes, unite, and uniprot databases. A wide-range of simulated datasets and real clinical specimens were created or obtained to evaluate Taxonomer. Taxonomer was compared to previously published pipelines (SURPI), classifiers (Kraken, RDP classifier), and sequence alignment methods (BLAST, SNAP, RapSearch2, DIAMOND). Taxonomer was also iv compared to a commercially available respiratory virus panel and utilized on a large cohort of pneumonia positive patients that had previously undergone extensive microbiological diagnostics. Taxonomer had agreement at 98.7% with SURPI to assign reads at the phylum level. Taxonomer, RDP classifier, and Kraken classified simulated 16S rRNA reads correctly at the species level at 59.5, 61.7, and 46.0%, respectively. Protein classification using reads derived from viruses showed similar sensitivity to alignment-based methods with RapSearch2, and DIAMOND but with slightly decreased analysis times. Taxonomer provides an accurate workflow for processing samples in a diagnostic setting. It identifies bacteria, fungi, virus, and human transcripts from clinical specimens with accuracy comparable to alignment methods. Its web-based front-end makes it accessible to laboratories without significant compute resources. |