TY - JOUR
T1 - The utility of multivariate outlier detection techniques for data quality evaluation in large studies
T2 - An application within the ONDRI project
AU - Sunderland, Kelly M.
AU - Beaton, Derek
AU - Fraser, Julia
AU - Kwan, Donna
AU - McLaughlin, Paula M.
AU - Montero-Odasso, Manuel
AU - Peltsch, Alicia J.
AU - Pieruccini-Faria, Frederico
AU - Sahlas, Demetrios J.
AU - Swartz, Richard H.
AU - Bartha, Robert
AU - Black, Sandra E.
AU - Borrie, Michael
AU - Corbett, Dale
AU - Finger, Elizabeth
AU - Freedman, Morris
AU - Greenberg, Barry
AU - Grimes, David A.
AU - Hegele, Robert A.
AU - Hudson, Chris
AU - Lang, Anthony E.
AU - Masellis, Mario
AU - McIlroy, William E.
AU - Munoz, David G.
AU - Munoz, Douglas P.
AU - Orange, J. B.
AU - Strong, Michael J.
AU - Symons, Sean
AU - Tartaglia, Maria Carmela
AU - Troyer, Angela
AU - Zinman, Lorne
AU - Strother, Stephen C.
AU - Binns, Malcolm A.
N1 - Funding Information:
This research was conducted with the support of the Ontario Brain Institute, an independent non-profit corporation, funded partially by the Ontario government. The opinions, results, and conclusions are those of the authors and no endorsement by the Ontario Brain Institute is intended or should be inferred. DB is partly supported by a Canadian Institutes of Health Research grant (MOP 201403). MMO is supported by grants from the Canadian Institutes of Health Research (PJT 153100), the Ontario Ministry of Research and Innovation (ER11–08-101), the Canadian Consortium in Neurodegeneration in Aging (CAN 137794), and by the Department of Medicine Program of Experimental Medicine Research Award (768915), University of Western Ontario, and a CIHR Investigator Award. RHS is supported by the Department of Medicine at Sunnybrook Health Sciences Centre and the University of Toronto, and a Heart and Stroke New Investigator Award.
Publisher Copyright:
© 2019 The Author(s).
PY - 2019/5/15
Y1 - 2019/5/15
N2 - Background: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow. Methods: We illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods - Minimum Covariance Determinant (MCD) and Candès' Robust Principal Component Analysis (RPCA) - and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification. Results: Of 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection. Conclusions: Manual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex.
AB - Background: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow. Methods: We illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods - Minimum Covariance Determinant (MCD) and Candès' Robust Principal Component Analysis (RPCA) - and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification. Results: Of 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection. Conclusions: Manual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex.
KW - Minimum covariance determinant
KW - Multivariate outliers
KW - Principal component analysis
KW - Quality control
KW - Visualization
UR - http://www.scopus.com/inward/record.url?scp=85066041162&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85066041162&partnerID=8YFLogxK
U2 - 10.1186/s12874-019-0737-5
DO - 10.1186/s12874-019-0737-5
M3 - Article
C2 - 31092212
AN - SCOPUS:85066041162
VL - 19
JO - BMC Medical Research Methodology
JF - BMC Medical Research Methodology
SN - 1471-2288
IS - 1
M1 - 102
ER -