Bayesian nonparametric cross-study validation of prediction methods

Lorenzo Trippa; Levi Waldron; Curtis Huttenhower; Giovanni Parmigiani

doi:10.1214/14-AOAS798

Bayesian nonparametric cross-study validation of prediction methods

Lorenzo Trippa, Levi Waldron, Curtis Huttenhower, Giovanni Parmigiani

Research output: Contribution to journal › Article › peer-review

15 Scopus citations

Abstract

We consider comparisons of statistical learning algorithms using multiple data sets, via leave-one-in cross-study validation: each of the algorithms is trained on one data set; the resulting model is then validated on each remaining data set. This poses two statistical challenges that need to be addressed simultaneously. The first is the assessment of study heterogeneity, with the aim of identifying a subset of studies within which algorithm comparisons can be reliably carried out. The second is the comparison of algorithms using the ensemble of data sets. We address both problems by integrating clustering and model comparison. We formulate a Bayesian model for the array of cross-study validation statistics, which defines clusters of studies with similar properties and provides the basis for meaningful algorithm comparison in the presence of study heterogeneity. We illustrate our approach through simulations involving studies with varying severity of systematic errors, and in the context of medical prognosis for patients diagnosed with cancer, using high-throughput measurements of the transcriptional activity of the tumor’s genes.

Original language	English (US)
Pages (from-to)	402-428
Number of pages	27
Journal	Annals of Applied Statistics
Volume	9
Issue number	1
DOIs	https://doi.org/10.1214/14-AOAS798
State	Published - 2015
Externally published	Yes

Keywords

Bayesian nonparametrics
Cancer signatures
Meta-analysis
Random partitions
Reproducibility
Validation analysis

ASJC Scopus subject areas

Statistics, Probability and Uncertainty
Modeling and Simulation
Statistics and Probability

Access to Document

10.1214/14-AOAS798

Cite this

@article{936a1f5579164e3bb8e01183a76aacd1,

title = "Bayesian nonparametric cross-study validation of prediction methods",

abstract = "We consider comparisons of statistical learning algorithms using multiple data sets, via leave-one-in cross-study validation: each of the algorithms is trained on one data set; the resulting model is then validated on each remaining data set. This poses two statistical challenges that need to be addressed simultaneously. The first is the assessment of study heterogeneity, with the aim of identifying a subset of studies within which algorithm comparisons can be reliably carried out. The second is the comparison of algorithms using the ensemble of data sets. We address both problems by integrating clustering and model comparison. We formulate a Bayesian model for the array of cross-study validation statistics, which defines clusters of studies with similar properties and provides the basis for meaningful algorithm comparison in the presence of study heterogeneity. We illustrate our approach through simulations involving studies with varying severity of systematic errors, and in the context of medical prognosis for patients diagnosed with cancer, using high-throughput measurements of the transcriptional activity of the tumor{\textquoteright}s genes.",

keywords = "Bayesian nonparametrics, Cancer signatures, Meta-analysis, Random partitions, Reproducibility, Validation analysis",

author = "Lorenzo Trippa and Levi Waldron and Curtis Huttenhower and Giovanni Parmigiani",

year = "2015",

doi = "10.1214/14-AOAS798",

language = "English (US)",

volume = "9",

pages = "402--428",

journal = "Annals of Applied Statistics",

issn = "1932-6157",

publisher = "Institute of Mathematical Statistics",

number = "1",

}

TY - JOUR

T1 - Bayesian nonparametric cross-study validation of prediction methods

AU - Trippa, Lorenzo

AU - Waldron, Levi

AU - Huttenhower, Curtis

AU - Parmigiani, Giovanni

PY - 2015

Y1 - 2015

N2 - We consider comparisons of statistical learning algorithms using multiple data sets, via leave-one-in cross-study validation: each of the algorithms is trained on one data set; the resulting model is then validated on each remaining data set. This poses two statistical challenges that need to be addressed simultaneously. The first is the assessment of study heterogeneity, with the aim of identifying a subset of studies within which algorithm comparisons can be reliably carried out. The second is the comparison of algorithms using the ensemble of data sets. We address both problems by integrating clustering and model comparison. We formulate a Bayesian model for the array of cross-study validation statistics, which defines clusters of studies with similar properties and provides the basis for meaningful algorithm comparison in the presence of study heterogeneity. We illustrate our approach through simulations involving studies with varying severity of systematic errors, and in the context of medical prognosis for patients diagnosed with cancer, using high-throughput measurements of the transcriptional activity of the tumor’s genes.

AB - We consider comparisons of statistical learning algorithms using multiple data sets, via leave-one-in cross-study validation: each of the algorithms is trained on one data set; the resulting model is then validated on each remaining data set. This poses two statistical challenges that need to be addressed simultaneously. The first is the assessment of study heterogeneity, with the aim of identifying a subset of studies within which algorithm comparisons can be reliably carried out. The second is the comparison of algorithms using the ensemble of data sets. We address both problems by integrating clustering and model comparison. We formulate a Bayesian model for the array of cross-study validation statistics, which defines clusters of studies with similar properties and provides the basis for meaningful algorithm comparison in the presence of study heterogeneity. We illustrate our approach through simulations involving studies with varying severity of systematic errors, and in the context of medical prognosis for patients diagnosed with cancer, using high-throughput measurements of the transcriptional activity of the tumor’s genes.

KW - Bayesian nonparametrics

KW - Cancer signatures

KW - Meta-analysis

KW - Random partitions

KW - Reproducibility

KW - Validation analysis

UR - http://www.scopus.com/inward/record.url?scp=84929669517&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84929669517&partnerID=8YFLogxK

U2 - 10.1214/14-AOAS798

DO - 10.1214/14-AOAS798

M3 - Article

AN - SCOPUS:84929669517

SN - 1932-6157

VL - 9

SP - 402

EP - 428

JO - Annals of Applied Statistics

JF - Annals of Applied Statistics

IS - 1

ER -

Bayesian nonparametric cross-study validation of prediction methods

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this