Journal of Proteomics & Bioinformatics

Journal of Proteomics & Bioinformatics
Open Access

ISSN: 0974-276X

+44 1223 790975


Feature Selection using Bootstrapped ROC Curves

Ping Xu, Xiang Liu, David Hadley, Shuai Huang, Jeffrey Krischer and Craig Beam

Background: In modeling a N by m data matrix, i.e. N samples on a m dimensional space, the issue arises when m is bigger than N. The sample size cannot be increased, especially in medical research, due to the limited number of diseased subjects. Feature selection is often used to select a subset of relevant m variables, often lower than N, for use in model construction.

Method: A multiple step bootstrap method is proposed to quantify relevance of candidate predictors with the outcome based on their areas under the Receiver Operating Characteristic curve (ROCAUCs) from bootstrap resamples and then select only significant variables, which meet pre-specified criteria, as a feature selection process.

Results: Extensive simulation was conducted using thousands of predictor variables and 5 levels of prediction ability between the true predictor and the outcome. The results from the simulation data indicate that the mean of ROCAUCs from bootstrap samples is close to the true ROCAUC. Even with only 30 cases and 30 controls, 25 out of 25 listed predictor variables provide the correct level of classification ability by using mean of bootstrapped ROCAUCs. The proposed bootstrapped ROCAUCs method outperforms the single ROCAUC. The standard error of mean of bootstrapped ROCAUCs was 20% to 50% smaller than the standard error of the single ROCAUC estimate from the original sample. An illustrative example is presented to apply the proposed methodology to identify the gene expressions that could predict clinical survival in breast cancer patients, using the Van’t Veer study’s breast cancer data.

Conclusion: We conclude that the bootstrapped ROCAUCs methodology is intuitive and attractive for use in feature selection problems when the goals of the study are to identify important predictors and to provide insight regarding the discriminative or predictive ability of individual predictor variables. Such goals are common among microarray studies and new biomarker discovery.