CMStatistics 2020: Start Registration
View Submission - CMStatistics
Title: Simultaneous feature selection and outlier detection with optimality guarantees Authors:  Luca Insolia - Scuola Normale Superiore (Italy) [presenting]
Ana Kenney - Pennsylvania State University (United States)
Francesca Chiaromonte - The Pennsylvania State University (United States)
Giovanni Felici - Consiglio Nazionale delle Ricerche (Italy)
Abstract: Sparse estimation in the presence of outliers has received considerable attention in the last decade. We contribute by considering high-dimensional regression models contaminated by multiple mean-shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, i.e., a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Notably, our proposal requires weaker assumptions than prior methods in the literature and, unlike such methods, it allows the sparsity level and/or the amount of contamination to grow with the number of predictors and/or the sample size. Moreover, we provide computationally efficient procedures to tune integer constraints and warm-start the solution algorithm, and, through simulations, show the superior performance of our proposal with respect to existing heuristic methods. Finally, the method is deployed to elicit the role of microbiome in childhood obesity.