COMPSTAT 2018: Start Registration
View Submission - COMPSTAT2018
Title: Ampliclust: A fully probabilistic model-based approach denoising Illumina amplicon data Authors:  Karin Dorman - Iowa State University (United States) [presenting]
Xiyu Peng - Iowa State University (United States)
Abstract: Next-generation amplicon sequencing is a powerful tool for understanding microbial communities. Downstream analysis is often based on the construction of Operational Taxonomic Units (OTUs) with dissimilarity threshold 3\%. The arbitrary threshold and reliance on OTU references can lead to low resolution, false positives, and misestimation of microbial diversity. We introduce Ampliclust, a reference-free method to resolve the number, abundance and identity of distinct variants sequenced in Illumina amplicon data. Unlike existing methods, Ampliclust is a fully probabilistic model, allowing the data to drive the conclusions rather than an algorithm or an external database. We use a modified Bayesian information criterion to estimate the number of sequence variants, and obtain maximum likelihood estimates of the abundance and identity of variants. Our model is able to match the performance of existing methods on well-separated mock communities, but achieves better accuracy in simulated communities with more similar variants. The major challenge for using mixture models in this context is the computational scalability to datasets consisting of millions or billions of observations in tens to thousands of clusters, which we begin to address through principled iterative schemes and improved initialization methods.