Title: Deriving nearly-optimal subdata
Authors: Min Yang - University of Illinois at Chicago (United States) [presenting]
Abstract: Big data brings the unprecedented challenge of analyzing such data due to its extraordinary size. One strategy for analyzing such massive data is data reduction. Instead of analyzing the full dataset, a selected subdata set is analyzed. Various subdata selection methods have been proposed. While the trade-off between computation complexity and statistical efficiency has been studied, little is known about how efficient the selected subdata is in terms of statistical efficiency. To answer this question, we need to find an optimal subdata. Deriving an optimal subdata, however, is an N-P hard problem. A novel framework to derive a nearly-optimal subdata, under any given statistical model, regardless of optimality criterion or parameters of interest, will be introduced. This framework has three benefits: (i) it shows us the structure of a nearly-optimal subdata for any given full data under various set-ups (model, optimality criterion, parameter of interest); (ii) it measures highly accurate statistical efficiency; and (iii) it provides a tool of deriving a nearly optimal subset in active learning where statistical efficiency is the main concern.