Title: Optimal and safe estimation for high-dimensional semi-supervised learning
Authors: Yang Ning - Cornell University (United States) [presenting]
Abstract: There are many scenarios, such as electronic health records, where the outcome is much more difficult to collect than the covariates. We consider the linear regression problem with such a data structure under high dimensionality. Our goal is to investigate when and how unlabeled data can be exploited to improve the estimation and inference of the regression parameters in linear models, especially in light of the fact that such linear models may be misspecified in data analysis. In particular, we address the following two important questions. (1) Can we use the labeled data as well as the unlabeled data to construct a semi-supervised estimator such that its convergence rate is faster than the supervised estimators? (2) Can we construct confidence intervals or hypothesis tests that are guaranteed to be more efficient or powerful than the supervised estimators?