CMStatistics 2017: Start Registration
View Submission - CMStatistics
B1156
Title: Training black-box models for de novo reconstruction in metagenomic data Authors:  Sergio Bacallado - Cambridge University (United States) [presenting]
Abstract: Metagenomic experiments sequence a mixture of genomes in a sample, for example, a mixture of bacterial genomes from a sample of stool. If there exist reference genomes for the taxa of interest, it is not difficult to assign short reads from a metagenomic dataset to different taxa. When there are taxa which have not been characterised, it becomes necessary to reconstruct their genomes de novo as their abundance in a range of samples is estimated. Current approaches to this problem use our mechanistic understanding of the sequencing process and prior information about the genomes present in the sample. This can involve steps of partial assembly into contiguous sequences, identification of core genes from reference genomes and variable sites within those genes, and statistical modelling of the species abundances in various samples. This process necessarily ignores most of the data, focusing on signals believed reliable a priori. Furthermore, solving the inverse problem at the final step, when species distributions are estimated, can be difficult if uncertainty quantification is desired. We will discuss a different approach, in which we train deep learning models to assign reads to species. The prior information is provided by the training data, which is simulated from a mechanistic model. We explore the reliability of different models, the representations of the data that are learned, and how they are used to classify the reads.