Title: Selecting influential /predictive variables for a large data set
Authors: Shaw-Hwa Lo - Columbia University (United States) [presenting]
Abstract: Prediction for very large data set is typically carried out in two stages, variable selection and pattern recognition. Ordinarily variable selection involves seeing how well individual explanatory variables are correlated with the dependent variable using a significance-based criterion. This practice neglects the possible interactions among the explanatory variables, so can choose less-predictive variables, because significance does not imply predictivity and important joint information may be omitted. When a subset of truly influential variables is identified, one may expect a noticeable increase of correct prediction rate, being true in both simple and complex data. However high dimensionality and complicated interactions have posed great difficulties for existing selection procedures. We consider an alternative selection approach that directly measures a variable set's ability to predict (termed ``predictivity''), the I-score, without relying on the CV. We argue that the I-score not only reflects the true amount of interactions among variables, it can be related to a lower bound of the correct prediction rate and does not over fit. The values of the I-score measure the amount of ``influence'' of the variables set under consideration. We suggest searching for a new criterion to locate highly predictive variables using partition retention (PR) method with I-score. The PR was effective in reducing prediction error from 30\% to 8\% on a long-studied breast cancer data set.