TY - JOUR
T1 - Collinearity and causal diagrams
AU - Schisterman, Enrique F.
AU - Perkins, Neil J.
AU - Mumford, Sunni L.
AU - Ahrens, Katherine A.
AU - Mitchell, Emily M.
N1 - Publisher Copyright:
© 2017 Wolters Kluwer Health, Inc.
PY - 2017/1/1
Y1 - 2017/1/1
N2 - Background: Correlated data are ubiquitous in epidemiologic research, particularly in nutritional and environmental epidemiology where mixtures of factors are often studied. Our objectives are to demonstrate how highly correlated data arise in epidemiologic research and provide guidance, using a directed acyclic graph approach, on how to proceed analytically when faced with highly correlated data. Methods: We identified three fundamental structural scenarios in which high correlation between a given variable and the exposure can arise: intermediates, confounders, and colliders. For each of these scenarios, we evaluated the consequences of increasing correlation between the given variable and the exposure on the bias and variance for the total effect of the exposure on the outcome using unadjusted and adjusted models. We derived closed-form solutions for continuous outcomes using linear regression and empirically present our findings for binary outcomes using logistic regression. Results: For models properly specified, total effect estimates remained unbiased even when there was almost perfect correlation between the exposure and a given intermediate, confounder, or collider. In general, as the correlation increased, the variance of the parameter estimate for the exposure in the adjusted models increased, while in the unadjusted models, the variance increased to a lesser extent or decreased. Conclusion: Our findings highlight the importance of considering the causal framework under study when specifying regression models. Strategies that do not take into consideration the causal structure may lead to biased effect estimation for the original question of interest, even under high correlation.
AB - Background: Correlated data are ubiquitous in epidemiologic research, particularly in nutritional and environmental epidemiology where mixtures of factors are often studied. Our objectives are to demonstrate how highly correlated data arise in epidemiologic research and provide guidance, using a directed acyclic graph approach, on how to proceed analytically when faced with highly correlated data. Methods: We identified three fundamental structural scenarios in which high correlation between a given variable and the exposure can arise: intermediates, confounders, and colliders. For each of these scenarios, we evaluated the consequences of increasing correlation between the given variable and the exposure on the bias and variance for the total effect of the exposure on the outcome using unadjusted and adjusted models. We derived closed-form solutions for continuous outcomes using linear regression and empirically present our findings for binary outcomes using logistic regression. Results: For models properly specified, total effect estimates remained unbiased even when there was almost perfect correlation between the exposure and a given intermediate, confounder, or collider. In general, as the correlation increased, the variance of the parameter estimate for the exposure in the adjusted models increased, while in the unadjusted models, the variance increased to a lesser extent or decreased. Conclusion: Our findings highlight the importance of considering the causal framework under study when specifying regression models. Strategies that do not take into consideration the causal structure may lead to biased effect estimation for the original question of interest, even under high correlation.
UR - http://www.scopus.com/inward/record.url?scp=84988695200&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84988695200&partnerID=8YFLogxK
U2 - 10.1097/EDE.0000000000000554
DO - 10.1097/EDE.0000000000000554
M3 - Article
C2 - 27676260
AN - SCOPUS:84988695200
SN - 1044-3983
VL - 28
SP - 47
EP - 53
JO - Epidemiology
JF - Epidemiology
IS - 1
ER -