CMStatistics 2021: Start Registration
View Submission - CMStatistics
Title: Surrogate assisted semi-supervised inference for high dimensional risk prediction Authors:  Jue Hou - Harvard T.H. Chan School of Public Health (United States) [presenting]
Zijian Guo - Rutgers University (United States)
Tianxi Cai - Harvard School of Public Health (United States)
Abstract: Risk modeling with EHR data is challenging due to a lack of direct observations on the disease outcome, and the high dimensionality of the candidate predictors. We develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large unlabeled data on candidate predictors and surrogates of the outcome, as well as a small labeled data with annotated outcomes. The SAS procedure borrows information from surrogates along with candidate predictors to impute the unobserved outcomes via a sparse working imputation model with moment conditions to achieve robustness against miss-specification in the imputation model and a one-step bias correction to enable interval estimation for the predicted risk. We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model, even when the underlying risk prediction model is dense and the risk model is miss-specified. We present an extensive simulation study to demonstrate the superiority of our SSL approach compared to existing supervised methods. We apply the method to derive genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort.