Empirical performance of cross-validation with oracle methods in a genomics context

Josue G. Martinez, Raymond J. Carroll, Samuel Müller, Joshua N. Sampson, Nilanjan Chatterjee

Research output: Contribution to journalArticle

Abstract

When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to nonoracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold crossvalidation with any oracle method, and not just the SCAD and Adaptive Lasso.

Original languageEnglish (US)
Pages (from-to)223-228
Number of pages6
JournalAmerican Statistician
Volume65
Issue number4
DOIs
StatePublished - Nov 2011
Externally publishedYes

Fingerprint

Adaptive Lasso
Cross-validation
Genomics
Fold
Deviation
Regression Function
Oracle Property
Lasso
Single nucleotide Polymorphism
Smoothing Parameter
Model Selection
Random variable
Regression
Context
Modeling
Estimate
Demonstrate

Keywords

  • Adaptive lasso
  • Lasso
  • Model selection
  • Oracle estimation

ASJC Scopus subject areas

  • Mathematics(all)
  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Cite this

Empirical performance of cross-validation with oracle methods in a genomics context. / Martinez, Josue G.; Carroll, Raymond J.; Müller, Samuel; Sampson, Joshua N.; Chatterjee, Nilanjan.

In: American Statistician, Vol. 65, No. 4, 11.2011, p. 223-228.

Research output: Contribution to journalArticle

Martinez, Josue G. ; Carroll, Raymond J. ; Müller, Samuel ; Sampson, Joshua N. ; Chatterjee, Nilanjan. / Empirical performance of cross-validation with oracle methods in a genomics context. In: American Statistician. 2011 ; Vol. 65, No. 4. pp. 223-228.
@article{4b2d3887484c4008abe146cf8d410c78,
title = "Empirical performance of cross-validation with oracle methods in a genomics context",
abstract = "When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to nonoracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold crossvalidation with any oracle method, and not just the SCAD and Adaptive Lasso.",
keywords = "Adaptive lasso, Lasso, Model selection, Oracle estimation",
author = "Martinez, {Josue G.} and Carroll, {Raymond J.} and Samuel M{\"u}ller and Sampson, {Joshua N.} and Nilanjan Chatterjee",
year = "2011",
month = "11",
doi = "10.1198/tas.2011.11052",
language = "English (US)",
volume = "65",
pages = "223--228",
journal = "American Statistician",
issn = "0003-1305",
publisher = "American Statistical Association",
number = "4",

}

TY - JOUR

T1 - Empirical performance of cross-validation with oracle methods in a genomics context

AU - Martinez, Josue G.

AU - Carroll, Raymond J.

AU - Müller, Samuel

AU - Sampson, Joshua N.

AU - Chatterjee, Nilanjan

PY - 2011/11

Y1 - 2011/11

N2 - When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to nonoracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold crossvalidation with any oracle method, and not just the SCAD and Adaptive Lasso.

AB - When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to nonoracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold crossvalidation with any oracle method, and not just the SCAD and Adaptive Lasso.

KW - Adaptive lasso

KW - Lasso

KW - Model selection

KW - Oracle estimation

UR - http://www.scopus.com/inward/record.url?scp=84856050251&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84856050251&partnerID=8YFLogxK

U2 - 10.1198/tas.2011.11052

DO - 10.1198/tas.2011.11052

M3 - Article

C2 - 22347720

AN - SCOPUS:84856050251

VL - 65

SP - 223

EP - 228

JO - American Statistician

JF - American Statistician

SN - 0003-1305

IS - 4

ER -