CMStatistics 2022: Start Registration
View Submission - CMStatistics
Title: Inferring taxonomic placement from DNA barcoding aiding in discovery of new taxa Authors:  Alessandro Zito - Duke University (United States) [presenting]
Tommaso Rigon - University of Milano-Bicocca (Italy)
David Dunson - Duke University (United States)
Abstract: Predicting the taxonomic affiliation of DNA sequences collected from biological samples is a fundamental step in biodiversity assessment. This task is performed by leveraging on existing databases containing reference DNA sequences whose taxa are known. However, environmental sequences can be from organisms that are either unknown to science or for which there are no reference sequences available. Thus, the taxonomic novelty of a sequence needs to be accounted for when doing classification. We propose Bayesian nonparametric taxonomic classifiers, BayesANT, which use species sampling model priors to allow new taxa to be discovered at each taxonomic rank. Using a simple product multinomial likelihood with conjugate Dirichlet priors at the lowest rank, a highly flexible algorithm is developed to provide a probabilistic prediction of the taxa placement of each sequence at each rank. As an illustration, we run our algorithm on a carefully annotated library of Finnish arthropods. To assess the ability of BayesANT to recognize novelty and to correctly predict known taxonomic affiliations, we test it on two training-test splitting Scenarios, each with a different proportion of taxa unobserved in training. We show how our algorithm attains excellent prediction performances and reliably quantifies classification uncertainty, especially when many sequences in the test set are affiliated with taxa unknown in training.