Abstract

Population Analysis of Bacterial Samples for Individual Identification in Forensics Application

John P Jakupciak, Jeffrey M Wells, Jeffrey S Lin and Andrew B Feldman

Biodefense preparedness begins with the ability to detect and respond to bio-threats, based on accurate interpretation of genetic information with sophisticated, yet easy-to-use bioinformatics tools. Microbial forensics further enables attribution of microbial pathogen samples back to a suspected source. Sample characterization and traceability back to source are dependent on genome identification of specific targets within samples, comprehensive analysis of mixtures of populations’ present, and detection of major/minor variations in the identified genomes and comparison of sample genetic profile against other samples. Commercial Next Generation Sequencing (NGS) platforms offer the promise of dramatically higher detection sensitivity and resolution of forensic DNA samples than is possible with methods in current use. Before applying these technologies for forensic analyses of bacterial samples, however, it is critical to fully elucidate the benefits, caveats and pitfalls of NGS for hypothesis testing in comparative analyses, as ultimately this will be required for NGS use both as an investigative tool and tool for attribution in courts of law. Methods: We developed and evaluated novel probabilistic algorithms to process metagenomic sequence data from direct sample sequencing to identify genomes present in mixtures. Results: We present a pipeline for reference-free sample-to-sample comparisons to improve target characterization beyond one microorganism to characterization of comprehensive sample content. Our tools strengthen statistical confidence to trace the ancestry of samples and attribute samples to source with probabilistic certainties on many targets instead of a single genome. Conclusion: This study developed a novel reference free, bioinformatics strategy to account for and identify genetic diversity in samples. Sequence variants must be non-arbitrarily confirmed in both forward and reverse reads at a rate above the background noise level of sequencer machine error. A similarity distance metric compares genomes within a range of near relationships. Using sequence data from bio-threat agents, we successfully attributed known related strains together, and excluded near relation of known unrelated strains. The major strengths of this forensic method are the non-arbitrary determinations of data validation and relatedness metrics, as well as the ability to compare microbial genomes with or without a reference database of related genomes.