Title: A unified nonparametric procedure on detecting spurious discoveries under sparse signals
Authors: Wen Zhou - Colorado State University (United States) [presenting]
Chao Zheng - University of Melbourne (Australia)
Wenxin Zhou - University of California San Diego (United States)
Lyuou Zhang - Colorado State University (United States)
Abstract: Identifying a subset of response-associated covariates from a large number of candidates has become a fundamental tool for scientific discoveries, particularly in biology including the differential analysis in genomics and the genome-wide association study in genetics. However, given the high dimensionality and the sparsity of signals in data, spurious discoveries can easily arise. Also, the ubiquitous data with mixed types, along with complex dependence, greatly limit the applicability of the traditional goodness-of-fit-based procedures. We introduce a statistical measure on the goodness of spurious fit based on the maximum rank correlations among predictors and responses. The proposed statistic imposes no assumptions on the data types, dependency, and the underlying models. We derive the asymptotic distribution of such goodness of spurious fit under very mild assumptions on the associations among predictors and responses. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of predictors under study, and the covariance information. We propose a multiplier bootstrap procedure to estimate such a distribution and utilize it as the benchmark to guard against spurious discoveries. It is also applied to the variable selection problems for the high dimensional generalized regressions. We applied our method to genetic studies to demonstrate that the proposed measure provides a statistical verification of the detected biomarkers.