CMStatistics 2021: Start Registration
View Submission - CMStatistics
B0812
Title: Simultaneous semi-parametric estimation of clustering and regression Authors:  Matthieu Marbac - CREST - ENSAI (France) [presenting]
Mohammed Sedki - Paris-Sud University, Inserm, Pasteur, UVSQ (France)
Christophe Biernacki - Inria (France)
Vincent Vandewalle - Inria (France)
Abstract: The parameter estimation of regression models with fixed group effects is investigated when the group variable is missing while group-related variables are available. This problem involves clustering to infer the missing group variable based on the group-related variables, and regression to build a model on the target variable given the group and eventually some additional variables. Thus, this problem can be formulated as the joint distribution modeling of the target and of the group-related variables. The usual parameter estimation strategy for this joint model is a two-step approach starting by learning the group variable (clustering step) and then plugging in its estimator for fitting the regression model (regression step). However, this approach is suboptimal (providing, in particular, biased regression estimates) since it does not make use of the target variable for clustering. Thus, we advise the use of a simultaneous estimation approach of both clustering and regression, in a semi-parametric framework. Numerical experiments illustrate the benefits of our proposition by considering wide ranges of distributions and regression models. The relevance of our new method is illustrated in real data dealing with problems associated with high blood pressure prevention. The proposed approach is implemented in the R package ClusPred available on CRAN.