Title: High-dimensional classification with positive and unlabeled data
Authors: Garvesh Raskutti - University of Wisconsin-Madison (United States) [presenting]
Abstract: In a number of scientific technologies, a classification problem presents only positive and unlabeled data. For example, deep mutational scanning in biochemistry is a high-throughput technology that relates biochemical function to sequence structure and this technology often only provides positive (indicating functional) and unlabeled responses for different protein sequences. Unlabeled data presents a challenge in classification which typically leads to a non-convex optimization problem, since there are hidden variables to indicate whether the unlabeled responses are positive or negative. Furthermore, since the protein sequences are long, the total number of features or covariates in the space is large. We present an approach which combines EM algorithm combined with quadratic majorization to address the computational challenge associated with high-dimensional PU learning. Furthermore, we provide statistical guarantees which prove convergence to a local minima in the high-dimensional setting. The performance of the algorithm is demonstrated both on simulated data and a real-world problem which addresses the question of how protein sequence structure influences biochemical function.