The utility of multivariate outlier detection techniques for data quality evaluation in large studies

An application within the ONDRI project

Kelly M. Sunderland, Derek Beaton, Julia Fraser, Donna Kwan, Paula M. McLaughlin, Manuel Montero-Odasso, Alicia J. Peltsch, Frederico Pieruccini-Faria, Demetrios J. Sahlas, Richard H. Swartz, Robert Bartha, Sandra E. Black, Michael Borrie, Dale Corbett, Elizabeth Finger, Morris Freedman, Barry Greenberg, David A. Grimes, Robert A. Hegele, Chris Hudson & 13 others Anthony E. Lang, Mario Masellis, William E. McIlroy, David G. Munoz, Douglas P. Munoz, J. B. Orange, Michael J. Strong, Sean Symons, Maria Carmela Tartaglia, Angela Troyer, Lorne Zinman, Stephen C. Strother, Malcolm A. Binns

Research output: Contribution to journalArticle

Abstract

Background: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow. Methods: We illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods - Minimum Covariance Determinant (MCD) and Candès' Robust Principal Component Analysis (RPCA) - and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification. Results: Of 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection. Conclusions: Manual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex.

Original languageEnglish (US)
Article number102
JournalBMC medical research methodology
Volume19
Issue number1
DOIs
StatePublished - May 15 2019

Fingerprint

Neuropsychology
Principal Component Analysis
Gait
Quality Control
Neurodegenerative Diseases
Social Adjustment
Data Accuracy
Ontario
Observational Studies
Blood Vessels
Cohort Studies
Multivariate Analysis
Biomarkers
Communication
Research

Keywords

  • Minimum covariance determinant
  • Multivariate outliers
  • Principal component analysis
  • Quality control
  • Visualization

ASJC Scopus subject areas

  • Epidemiology
  • Health Informatics

Cite this

The utility of multivariate outlier detection techniques for data quality evaluation in large studies : An application within the ONDRI project. / Sunderland, Kelly M.; Beaton, Derek; Fraser, Julia; Kwan, Donna; McLaughlin, Paula M.; Montero-Odasso, Manuel; Peltsch, Alicia J.; Pieruccini-Faria, Frederico; Sahlas, Demetrios J.; Swartz, Richard H.; Bartha, Robert; Black, Sandra E.; Borrie, Michael; Corbett, Dale; Finger, Elizabeth; Freedman, Morris; Greenberg, Barry; Grimes, David A.; Hegele, Robert A.; Hudson, Chris; Lang, Anthony E.; Masellis, Mario; McIlroy, William E.; Munoz, David G.; Munoz, Douglas P.; Orange, J. B.; Strong, Michael J.; Symons, Sean; Tartaglia, Maria Carmela; Troyer, Angela; Zinman, Lorne; Strother, Stephen C.; Binns, Malcolm A.

In: BMC medical research methodology, Vol. 19, No. 1, 102, 15.05.2019.

Research output: Contribution to journalArticle

Sunderland, KM, Beaton, D, Fraser, J, Kwan, D, McLaughlin, PM, Montero-Odasso, M, Peltsch, AJ, Pieruccini-Faria, F, Sahlas, DJ, Swartz, RH, Bartha, R, Black, SE, Borrie, M, Corbett, D, Finger, E, Freedman, M, Greenberg, B, Grimes, DA, Hegele, RA, Hudson, C, Lang, AE, Masellis, M, McIlroy, WE, Munoz, DG, Munoz, DP, Orange, JB, Strong, MJ, Symons, S, Tartaglia, MC, Troyer, A, Zinman, L, Strother, SC & Binns, MA 2019, 'The utility of multivariate outlier detection techniques for data quality evaluation in large studies: An application within the ONDRI project', BMC medical research methodology, vol. 19, no. 1, 102. https://doi.org/10.1186/s12874-019-0737-5
Sunderland, Kelly M. ; Beaton, Derek ; Fraser, Julia ; Kwan, Donna ; McLaughlin, Paula M. ; Montero-Odasso, Manuel ; Peltsch, Alicia J. ; Pieruccini-Faria, Frederico ; Sahlas, Demetrios J. ; Swartz, Richard H. ; Bartha, Robert ; Black, Sandra E. ; Borrie, Michael ; Corbett, Dale ; Finger, Elizabeth ; Freedman, Morris ; Greenberg, Barry ; Grimes, David A. ; Hegele, Robert A. ; Hudson, Chris ; Lang, Anthony E. ; Masellis, Mario ; McIlroy, William E. ; Munoz, David G. ; Munoz, Douglas P. ; Orange, J. B. ; Strong, Michael J. ; Symons, Sean ; Tartaglia, Maria Carmela ; Troyer, Angela ; Zinman, Lorne ; Strother, Stephen C. ; Binns, Malcolm A. / The utility of multivariate outlier detection techniques for data quality evaluation in large studies : An application within the ONDRI project. In: BMC medical research methodology. 2019 ; Vol. 19, No. 1.
@article{5c8898967cd1450489ae170e00d0083d,
title = "The utility of multivariate outlier detection techniques for data quality evaluation in large studies: An application within the ONDRI project",
abstract = "Background: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow. Methods: We illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods - Minimum Covariance Determinant (MCD) and Cand{\`e}s' Robust Principal Component Analysis (RPCA) - and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification. Results: Of 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection. Conclusions: Manual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex.",
keywords = "Minimum covariance determinant, Multivariate outliers, Principal component analysis, Quality control, Visualization",
author = "Sunderland, {Kelly M.} and Derek Beaton and Julia Fraser and Donna Kwan and McLaughlin, {Paula M.} and Manuel Montero-Odasso and Peltsch, {Alicia J.} and Frederico Pieruccini-Faria and Sahlas, {Demetrios J.} and Swartz, {Richard H.} and Robert Bartha and Black, {Sandra E.} and Michael Borrie and Dale Corbett and Elizabeth Finger and Morris Freedman and Barry Greenberg and Grimes, {David A.} and Hegele, {Robert A.} and Chris Hudson and Lang, {Anthony E.} and Mario Masellis and McIlroy, {William E.} and Munoz, {David G.} and Munoz, {Douglas P.} and Orange, {J. B.} and Strong, {Michael J.} and Sean Symons and Tartaglia, {Maria Carmela} and Angela Troyer and Lorne Zinman and Strother, {Stephen C.} and Binns, {Malcolm A.}",
year = "2019",
month = "5",
day = "15",
doi = "10.1186/s12874-019-0737-5",
language = "English (US)",
volume = "19",
journal = "BMC Medical Research Methodology",
issn = "1471-2288",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - The utility of multivariate outlier detection techniques for data quality evaluation in large studies

T2 - An application within the ONDRI project

AU - Sunderland, Kelly M.

AU - Beaton, Derek

AU - Fraser, Julia

AU - Kwan, Donna

AU - McLaughlin, Paula M.

AU - Montero-Odasso, Manuel

AU - Peltsch, Alicia J.

AU - Pieruccini-Faria, Frederico

AU - Sahlas, Demetrios J.

AU - Swartz, Richard H.

AU - Bartha, Robert

AU - Black, Sandra E.

AU - Borrie, Michael

AU - Corbett, Dale

AU - Finger, Elizabeth

AU - Freedman, Morris

AU - Greenberg, Barry

AU - Grimes, David A.

AU - Hegele, Robert A.

AU - Hudson, Chris

AU - Lang, Anthony E.

AU - Masellis, Mario

AU - McIlroy, William E.

AU - Munoz, David G.

AU - Munoz, Douglas P.

AU - Orange, J. B.

AU - Strong, Michael J.

AU - Symons, Sean

AU - Tartaglia, Maria Carmela

AU - Troyer, Angela

AU - Zinman, Lorne

AU - Strother, Stephen C.

AU - Binns, Malcolm A.

PY - 2019/5/15

Y1 - 2019/5/15

N2 - Background: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow. Methods: We illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods - Minimum Covariance Determinant (MCD) and Candès' Robust Principal Component Analysis (RPCA) - and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification. Results: Of 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection. Conclusions: Manual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex.

AB - Background: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow. Methods: We illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods - Minimum Covariance Determinant (MCD) and Candès' Robust Principal Component Analysis (RPCA) - and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification. Results: Of 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection. Conclusions: Manual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex.

KW - Minimum covariance determinant

KW - Multivariate outliers

KW - Principal component analysis

KW - Quality control

KW - Visualization

UR - http://www.scopus.com/inward/record.url?scp=85066041162&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066041162&partnerID=8YFLogxK

U2 - 10.1186/s12874-019-0737-5

DO - 10.1186/s12874-019-0737-5

M3 - Article

VL - 19

JO - BMC Medical Research Methodology

JF - BMC Medical Research Methodology

SN - 1471-2288

IS - 1

M1 - 102

ER -