Methods for correcting inference based on outcomes predicted by machine learning

Siruo Wang; Tyler H. McCormick; Jeffrey T. Leek

doi:10.1073/pnas.2001238117

Methods for correcting inference based on outcomes predicted by machine learning

Siruo Wang, Tyler H. McCormick, Jeffrey T. Leek

Bloomberg School of Public Health

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.

Original language	English (US)
Pages (from-to)	30266-30275
Number of pages	10
Journal	Proceedings of the National Academy of Sciences of the United States of America
Volume	117
Issue number	48
DOIs	https://doi.org/10.1073/pnas.2001238117
State	Published - Dec 1 2020

Keywords

Statistics | machine learning | postprediction inference | interpretability

ASJC Scopus subject areas

General

Access to Document

10.1073/pnas.2001238117

Cite this

@article{65602833269f49d19e0fb9db98c9318e,

title = "Methods for correcting inference based on outcomes predicted by machine learning",

abstract = "Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.",

keywords = "Statistics | machine learning | postprediction inference | interpretability",

author = "Siruo Wang and McCormick, {Tyler H.} and Leek, {Jeffrey T.}",

note = "Funding Information: ACKNOWLEDGMENTS. The research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health (NIH) under Award R01GM121459, the National Institute of Mental Health of the NIH under Award DP2MH122405, and the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the NIH under Award R21HD095451. Publisher Copyright: {\textcopyright} 2020 National Academy of Sciences. All rights reserved.",

year = "2020",

month = dec,

day = "1",

doi = "10.1073/pnas.2001238117",

language = "English (US)",

volume = "117",

pages = "30266--30275",

journal = "Proceedings of the National Academy of Sciences of the United States of America",

issn = "0027-8424",

publisher = "National Academy of Sciences",

number = "48",

}

TY - JOUR

T1 - Methods for correcting inference based on outcomes predicted by machine learning

AU - Wang, Siruo

AU - McCormick, Tyler H.

AU - Leek, Jeffrey T.

N1 - Funding Information: ACKNOWLEDGMENTS. The research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health (NIH) under Award R01GM121459, the National Institute of Mental Health of the NIH under Award DP2MH122405, and the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the NIH under Award R21HD095451. Publisher Copyright: © 2020 National Academy of Sciences. All rights reserved.

PY - 2020/12/1

Y1 - 2020/12/1

N2 - Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.

AB - Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.

KW - Statistics | machine learning | postprediction inference | interpretability

UR - http://www.scopus.com/inward/record.url?scp=85097210923&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85097210923&partnerID=8YFLogxK

U2 - 10.1073/pnas.2001238117

DO - 10.1073/pnas.2001238117

M3 - Article

C2 - 33208538

AN - SCOPUS:85097210923

SN - 0027-8424

VL - 117

SP - 30266

EP - 30275

JO - Proceedings of the National Academy of Sciences of the United States of America

JF - Proceedings of the National Academy of Sciences of the United States of America

IS - 48

ER -

Methods for correcting inference based on outcomes predicted by machine learning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this