Testing for improvement in prediction model performance

Margaret Sullivan Pepe; Kathleen F. Kerr; Gary Longton; Zheyu Wang

doi:10.1002/sim.5727

Testing for improvement in prediction model performance

Margaret Sullivan Pepe, Kathleen F. Kerr, Gary Longton, Zheyu Wang

Research output: Contribution to journal › Article › peer-review

136 Scopus citations

Abstract

Authors have proposed new methodology in recent years for evaluating the improvement in prediction performance gained by adding a new predictor, Y, to a risk model containing a set of baseline predictors, X, for a binary outcome D. We prove theoretically that null hypotheses concerning no improvement in performance are equivalent to the simple null hypothesis that Y is not a risk factor when controlling for X, H₀: P(D=1|X,Y)=P(D=1|X). Therefore, testing for improvement in prediction performance is redundant if Y has already been shown to be a risk factor. We also investigate properties of tests through simulation studies, focusing on the change in the area under the ROC curve (AUC). An unexpected finding is that standard testing procedures that do not adjust for variability in estimated regression coefficients are extremely conservative. This may explain why the AUC is widely considered insensitive to improvements in prediction performance and suggests that the problem of insensitivity has to do with use of invalid procedures for inference rather than with the measure itself. To avoid redundant testing and use of potentially problematic methods for inference, we recommend that hypothesis testing for no improvement be limited to evaluation of Y as a risk factor, for which methods are well developed and widely available. Analyses of measures of prediction performance should focus on estimation rather than on testing for no improvement in performance.

Original language	English (US)
Pages (from-to)	1467-1482
Number of pages	16
Journal	Statistics in Medicine
Volume	32
Issue number	9
DOIs	https://doi.org/10.1002/sim.5727
State	Published - Apr 30 2013
Externally published	Yes

Keywords

Biomarker
Logistic regression
Receiver operating characteristic curve
Risk factors
Risk reclassification

ASJC Scopus subject areas

Epidemiology
Statistics and Probability

Access to Document

10.1002/sim.5727

Cite this

@article{107cbaa1904a4267a74fd62bbb3d3025,

title = "Testing for improvement in prediction model performance",

abstract = "Authors have proposed new methodology in recent years for evaluating the improvement in prediction performance gained by adding a new predictor, Y, to a risk model containing a set of baseline predictors, X, for a binary outcome D. We prove theoretically that null hypotheses concerning no improvement in performance are equivalent to the simple null hypothesis that Y is not a risk factor when controlling for X, H0: P(D=1|X,Y)=P(D=1|X). Therefore, testing for improvement in prediction performance is redundant if Y has already been shown to be a risk factor. We also investigate properties of tests through simulation studies, focusing on the change in the area under the ROC curve (AUC). An unexpected finding is that standard testing procedures that do not adjust for variability in estimated regression coefficients are extremely conservative. This may explain why the AUC is widely considered insensitive to improvements in prediction performance and suggests that the problem of insensitivity has to do with use of invalid procedures for inference rather than with the measure itself. To avoid redundant testing and use of potentially problematic methods for inference, we recommend that hypothesis testing for no improvement be limited to evaluation of Y as a risk factor, for which methods are well developed and widely available. Analyses of measures of prediction performance should focus on estimation rather than on testing for no improvement in performance.",

keywords = "Biomarker, Logistic regression, Receiver operating characteristic curve, Risk factors, Risk reclassification",

author = "Pepe, {Margaret Sullivan} and Kerr, {Kathleen F.} and Gary Longton and Zheyu Wang",

year = "2013",

month = apr,

day = "30",

doi = "10.1002/sim.5727",

language = "English (US)",

volume = "32",

pages = "1467--1482",

journal = "Statistics in Medicine",

issn = "0277-6715",

publisher = "John Wiley and Sons Ltd",

number = "9",

}

TY - JOUR

T1 - Testing for improvement in prediction model performance

AU - Pepe, Margaret Sullivan

AU - Kerr, Kathleen F.

AU - Longton, Gary

AU - Wang, Zheyu

PY - 2013/4/30

Y1 - 2013/4/30

N2 - Authors have proposed new methodology in recent years for evaluating the improvement in prediction performance gained by adding a new predictor, Y, to a risk model containing a set of baseline predictors, X, for a binary outcome D. We prove theoretically that null hypotheses concerning no improvement in performance are equivalent to the simple null hypothesis that Y is not a risk factor when controlling for X, H0: P(D=1|X,Y)=P(D=1|X). Therefore, testing for improvement in prediction performance is redundant if Y has already been shown to be a risk factor. We also investigate properties of tests through simulation studies, focusing on the change in the area under the ROC curve (AUC). An unexpected finding is that standard testing procedures that do not adjust for variability in estimated regression coefficients are extremely conservative. This may explain why the AUC is widely considered insensitive to improvements in prediction performance and suggests that the problem of insensitivity has to do with use of invalid procedures for inference rather than with the measure itself. To avoid redundant testing and use of potentially problematic methods for inference, we recommend that hypothesis testing for no improvement be limited to evaluation of Y as a risk factor, for which methods are well developed and widely available. Analyses of measures of prediction performance should focus on estimation rather than on testing for no improvement in performance.

AB - Authors have proposed new methodology in recent years for evaluating the improvement in prediction performance gained by adding a new predictor, Y, to a risk model containing a set of baseline predictors, X, for a binary outcome D. We prove theoretically that null hypotheses concerning no improvement in performance are equivalent to the simple null hypothesis that Y is not a risk factor when controlling for X, H0: P(D=1|X,Y)=P(D=1|X). Therefore, testing for improvement in prediction performance is redundant if Y has already been shown to be a risk factor. We also investigate properties of tests through simulation studies, focusing on the change in the area under the ROC curve (AUC). An unexpected finding is that standard testing procedures that do not adjust for variability in estimated regression coefficients are extremely conservative. This may explain why the AUC is widely considered insensitive to improvements in prediction performance and suggests that the problem of insensitivity has to do with use of invalid procedures for inference rather than with the measure itself. To avoid redundant testing and use of potentially problematic methods for inference, we recommend that hypothesis testing for no improvement be limited to evaluation of Y as a risk factor, for which methods are well developed and widely available. Analyses of measures of prediction performance should focus on estimation rather than on testing for no improvement in performance.

KW - Biomarker

KW - Logistic regression

KW - Receiver operating characteristic curve

KW - Risk factors

KW - Risk reclassification

UR - http://www.scopus.com/inward/record.url?scp=84876321464&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84876321464&partnerID=8YFLogxK

U2 - 10.1002/sim.5727

DO - 10.1002/sim.5727

M3 - Article

C2 - 23296397

AN - SCOPUS:84876321464

SN - 0277-6715

VL - 32

SP - 1467

EP - 1482

JO - Statistics in Medicine

JF - Statistics in Medicine

IS - 9

ER -

Testing for improvement in prediction model performance

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this