Title: ALS algorithm for CDPCA on high-dimensional data sets: An empirical study
Authors: Adelaide Freitas - University of Aveiro (Portugal) [presenting]
Maurizio Vichi - University La Sapienza, Rome (Italy)
Abstract: Applied on high-dimensional data sets, constrained Principal Component Analysis (PCA) techniques, those yielding sparse solutions, are particularly useful to make easier the interpretation of the components. Clustering and Disjoint Principal Component Analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix and, simultaneously, the identification of clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically evaluate the performance of the Alternating Least Square (ALS) algorithm, a heuristic iterative procedure proposed in the specialized literature to perform CDPCA. Our numerical tests show that ALS performs well and produces satisfactory results in terms of solution precision. In recovering the true object clusters, the complexity of the data structure (i.e., the error level of the CDPCA model on which the data was generated) seems to influence the ability of ALS when the sample size is not so high. For a lower sample size, ALS performs better when the error level is lower. The proportion of explained variance by the components estimated by ALS is affected by the data structure complexity (the higher the error level, the lower variance).