Variable selection using iterative reformulation of training set models for discrimination of samples: Application to gas chromatography/mass spectrometry of mouse urinary metabolites

Kanet Wongravee, Nina Heinrich, Maria Holmboe, Michele Schaefer, Randall R Reed, Jose Trevejo, Richard G. Brereton

Research output: Contribution to journalArticle

Abstract

The paper discusses variable selection as used in large metabolomic studies, exemplified by mouse urinary gas chromatography of 441 mice in three experiments to detect the influence of age, diet, and stress on their chemosignal. Partial least squares discriminant analysis (PLS-DA) was applied to obtain class models, using a procedure of 20 000 iterations including the bootstrap for model optimization and random splits into test and training sets for validation. Variables are selected using PLS regression coefficients on the training set using an optimized number of components obtained from the bootstrap. The variables are ranked in order of significance, and the overall optimal variables are selected as those that appear as highly significant over 100 different test and training set splits. Cost/benefit analysis of performing the model on a reduced number of variables is also illustrated. This paper provides a strategy for properly validated methods for determining which variables are most significant for discriminating between two groups in large metabolomic data sets avoiding the common pitfall of overfitting if variables are selected on a combined training and test set and also taking into account that different variables may be selected each time the samples are split into training and test sets using iterative procedures.

Original languageEnglish (US)
Pages (from-to)5204-5217
Number of pages14
JournalAnalytical Chemistry
Volume81
Issue number13
DOIs
StatePublished - Jul 1 2009

Fingerprint

Metabolites
Gas chromatography
Mass spectrometry
Cost benefit analysis
Discriminant analysis
Nutrition
Experiments
Metabolomics

ASJC Scopus subject areas

  • Analytical Chemistry

Cite this

Variable selection using iterative reformulation of training set models for discrimination of samples : Application to gas chromatography/mass spectrometry of mouse urinary metabolites. / Wongravee, Kanet; Heinrich, Nina; Holmboe, Maria; Schaefer, Michele; Reed, Randall R; Trevejo, Jose; Brereton, Richard G.

In: Analytical Chemistry, Vol. 81, No. 13, 01.07.2009, p. 5204-5217.

Research output: Contribution to journalArticle

@article{5947a99ff80a4ce5a5ec4501c618d117,
title = "Variable selection using iterative reformulation of training set models for discrimination of samples: Application to gas chromatography/mass spectrometry of mouse urinary metabolites",
abstract = "The paper discusses variable selection as used in large metabolomic studies, exemplified by mouse urinary gas chromatography of 441 mice in three experiments to detect the influence of age, diet, and stress on their chemosignal. Partial least squares discriminant analysis (PLS-DA) was applied to obtain class models, using a procedure of 20 000 iterations including the bootstrap for model optimization and random splits into test and training sets for validation. Variables are selected using PLS regression coefficients on the training set using an optimized number of components obtained from the bootstrap. The variables are ranked in order of significance, and the overall optimal variables are selected as those that appear as highly significant over 100 different test and training set splits. Cost/benefit analysis of performing the model on a reduced number of variables is also illustrated. This paper provides a strategy for properly validated methods for determining which variables are most significant for discriminating between two groups in large metabolomic data sets avoiding the common pitfall of overfitting if variables are selected on a combined training and test set and also taking into account that different variables may be selected each time the samples are split into training and test sets using iterative procedures.",
author = "Kanet Wongravee and Nina Heinrich and Maria Holmboe and Michele Schaefer and Reed, {Randall R} and Jose Trevejo and Brereton, {Richard G.}",
year = "2009",
month = "7",
day = "1",
doi = "10.1021/ac900251c",
language = "English (US)",
volume = "81",
pages = "5204--5217",
journal = "Analytical Chemistry",
issn = "0003-2700",
publisher = "American Chemical Society",
number = "13",

}

TY - JOUR

T1 - Variable selection using iterative reformulation of training set models for discrimination of samples

T2 - Application to gas chromatography/mass spectrometry of mouse urinary metabolites

AU - Wongravee, Kanet

AU - Heinrich, Nina

AU - Holmboe, Maria

AU - Schaefer, Michele

AU - Reed, Randall R

AU - Trevejo, Jose

AU - Brereton, Richard G.

PY - 2009/7/1

Y1 - 2009/7/1

N2 - The paper discusses variable selection as used in large metabolomic studies, exemplified by mouse urinary gas chromatography of 441 mice in three experiments to detect the influence of age, diet, and stress on their chemosignal. Partial least squares discriminant analysis (PLS-DA) was applied to obtain class models, using a procedure of 20 000 iterations including the bootstrap for model optimization and random splits into test and training sets for validation. Variables are selected using PLS regression coefficients on the training set using an optimized number of components obtained from the bootstrap. The variables are ranked in order of significance, and the overall optimal variables are selected as those that appear as highly significant over 100 different test and training set splits. Cost/benefit analysis of performing the model on a reduced number of variables is also illustrated. This paper provides a strategy for properly validated methods for determining which variables are most significant for discriminating between two groups in large metabolomic data sets avoiding the common pitfall of overfitting if variables are selected on a combined training and test set and also taking into account that different variables may be selected each time the samples are split into training and test sets using iterative procedures.

AB - The paper discusses variable selection as used in large metabolomic studies, exemplified by mouse urinary gas chromatography of 441 mice in three experiments to detect the influence of age, diet, and stress on their chemosignal. Partial least squares discriminant analysis (PLS-DA) was applied to obtain class models, using a procedure of 20 000 iterations including the bootstrap for model optimization and random splits into test and training sets for validation. Variables are selected using PLS regression coefficients on the training set using an optimized number of components obtained from the bootstrap. The variables are ranked in order of significance, and the overall optimal variables are selected as those that appear as highly significant over 100 different test and training set splits. Cost/benefit analysis of performing the model on a reduced number of variables is also illustrated. This paper provides a strategy for properly validated methods for determining which variables are most significant for discriminating between two groups in large metabolomic data sets avoiding the common pitfall of overfitting if variables are selected on a combined training and test set and also taking into account that different variables may be selected each time the samples are split into training and test sets using iterative procedures.

UR - http://www.scopus.com/inward/record.url?scp=67649948769&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67649948769&partnerID=8YFLogxK

U2 - 10.1021/ac900251c

DO - 10.1021/ac900251c

M3 - Article

C2 - 19507882

AN - SCOPUS:67649948769

VL - 81

SP - 5204

EP - 5217

JO - Analytical Chemistry

JF - Analytical Chemistry

SN - 0003-2700

IS - 13

ER -