Pitfalls of merging GWAS data: Lessons learned in the eMERGE network and quality control procedures to maintain high data quality

Rebecca L. Zuvich; Loren L. Armstrong; Suzette J. Bielinski; Yuki Bradford; Christopher S. Carlson; Dana C. Crawford; Andrew T. Crenshaw; Mariza de Andrade; Kimberly F. Doheny; Jonathan L. Haines; M. Geoffrey Hayes; Gail P. Jarvik; Lan Jiang; Iftikhar J. Kullo; Rongling Li; Hua Ling; Teri A. Manolio; Martha E. Matsumoto; Catherine A. Mccarty; Andrew N. Mcdavid; Daniel B. Mirel; Lana M. Olson; Justin E. Paschall; Elizabeth W. Pugh; Luke V. Rasmussen; Laura J. Rasmussen-Torvik; Stephen D. Turner; Russell A. Wilke; Marylyn D. Ritchie

doi:10.1002/gepi.20639

Pitfalls of merging GWAS data: Lessons learned in the eMERGE network and quality control procedures to maintain high data quality

Rebecca L. Zuvich, Loren L. Armstrong, Suzette J. Bielinski, Yuki Bradford, Christopher S. Carlson, Dana C. Crawford, Andrew T. Crenshaw, Mariza de Andrade, Kimberly F. Doheny, Jonathan L. Haines, M. Geoffrey Hayes, Gail P. Jarvik, Lan Jiang, Iftikhar J. Kullo, Rongling Li, Hua Ling, Teri A. Manolio, Martha E. Matsumoto, Catherine A. Mccarty, Andrew N. McdavidDaniel B. Mirel, Lana M. Olson, Justin E. Paschall, Elizabeth W. Pugh, Luke V. Rasmussen, Laura J. Rasmussen-Torvik, Stephen D. Turner, Russell A. Wilke, Marylyn D. Ritchie

School of Medicine

Research output: Contribution to journal › Article › peer-review

46 Scopus citations

Abstract

Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient reuse of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of 14 phenotypes for extraction of study samples from each site's DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample and marker quality and various batch effects. Upon completion of the genotyping and QC analyses for each site's primary study, eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset reentered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here, we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II, and also serve as a starting point for investigators merging multiple genotype datasets accessible through the National Center for Biotechnology Information in the database of Genotypes and Phenotypes. Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.

Original language	English (US)
Pages (from-to)	887-898
Number of pages	12
Journal	Genetic epidemiology
Volume	35
Issue number	8
DOIs	https://doi.org/10.1002/gepi.20639
State	Published - Dec 2011

Keywords

DbGaP
EMERGE
Genome-wide association (GWAS)
Merging datasets
Quality control

ASJC Scopus subject areas

Epidemiology
Genetics(clinical)

Access to Document

10.1002/gepi.20639

Cite this

Zuvich, R. L., Armstrong, L. L., Bielinski, S. J., Bradford, Y., Carlson, C. S., Crawford, D. C., Crenshaw, A. T., de Andrade, M., Doheny, K. F., Haines, J. L., Hayes, M. G., Jarvik, G. P., Jiang, L., Kullo, I. J., Li, R., Ling, H., Manolio, T. A., Matsumoto, M. E., Mccarty, C. A., ... Ritchie, M. D. (2011). Pitfalls of merging GWAS data: Lessons learned in the eMERGE network and quality control procedures to maintain high data quality. Genetic epidemiology, 35(8), 887-898. https://doi.org/10.1002/gepi.20639

Zuvich, RL, Armstrong, LL, Bielinski, SJ, Bradford, Y, Carlson, CS, Crawford, DC, Crenshaw, AT, de Andrade, M, Doheny, KF, Haines, JL, Hayes, MG, Jarvik, GP, Jiang, L, Kullo, IJ, Li, R, Ling, H, Manolio, TA, Matsumoto, ME, Mccarty, CA, Mcdavid, AN, Mirel, DB, Olson, LM, Paschall, JE, Pugh, EW, Rasmussen, LV, Rasmussen-Torvik, LJ, Turner, SD, Wilke, RA & Ritchie, MD 2011, 'Pitfalls of merging GWAS data: Lessons learned in the eMERGE network and quality control procedures to maintain high data quality', Genetic epidemiology, vol. 35, no. 8, pp. 887-898. https://doi.org/10.1002/gepi.20639

@article{7ea21c01cdd04b09949b7d918482c4ca,

title = "Pitfalls of merging GWAS data: Lessons learned in the eMERGE network and quality control procedures to maintain high data quality",

abstract = "Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient reuse of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of 14 phenotypes for extraction of study samples from each site's DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample and marker quality and various batch effects. Upon completion of the genotyping and QC analyses for each site's primary study, eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset reentered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here, we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II, and also serve as a starting point for investigators merging multiple genotype datasets accessible through the National Center for Biotechnology Information in the database of Genotypes and Phenotypes. Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.",

keywords = "DbGaP, EMERGE, Genome-wide association (GWAS), Merging datasets, Quality control",

author = "Zuvich, {Rebecca L.} and Armstrong, {Loren L.} and Bielinski, {Suzette J.} and Yuki Bradford and Carlson, {Christopher S.} and Crawford, {Dana C.} and Crenshaw, {Andrew T.} and {de Andrade}, Mariza and Doheny, {Kimberly F.} and Haines, {Jonathan L.} and Hayes, {M. Geoffrey} and Jarvik, {Gail P.} and Lan Jiang and Kullo, {Iftikhar J.} and Rongling Li and Hua Ling and Manolio, {Teri A.} and Matsumoto, {Martha E.} and Mccarty, {Catherine A.} and Mcdavid, {Andrew N.} and Mirel, {Daniel B.} and Olson, {Lana M.} and Paschall, {Justin E.} and Pugh, {Elizabeth W.} and Rasmussen, {Luke V.} and Rasmussen-Torvik, {Laura J.} and Turner, {Stephen D.} and Wilke, {Russell A.} and Ritchie, {Marylyn D.}",

year = "2011",

month = dec,

doi = "10.1002/gepi.20639",

language = "English (US)",

volume = "35",

pages = "887--898",

journal = "Genetic epidemiology",

issn = "0741-0395",

publisher = "Wiley-Liss Inc.",

number = "8",

}

TY - JOUR

T1 - Pitfalls of merging GWAS data

T2 - Lessons learned in the eMERGE network and quality control procedures to maintain high data quality

AU - Zuvich, Rebecca L.

AU - Armstrong, Loren L.

AU - Bielinski, Suzette J.

AU - Bradford, Yuki

AU - Carlson, Christopher S.

AU - Crawford, Dana C.

AU - Crenshaw, Andrew T.

AU - de Andrade, Mariza

AU - Doheny, Kimberly F.

AU - Haines, Jonathan L.

AU - Hayes, M. Geoffrey

AU - Jarvik, Gail P.

AU - Jiang, Lan

AU - Kullo, Iftikhar J.

AU - Li, Rongling

AU - Ling, Hua

AU - Manolio, Teri A.

AU - Matsumoto, Martha E.

AU - Mccarty, Catherine A.

AU - Mcdavid, Andrew N.

AU - Mirel, Daniel B.

AU - Olson, Lana M.

AU - Paschall, Justin E.

AU - Pugh, Elizabeth W.

AU - Rasmussen, Luke V.

AU - Rasmussen-Torvik, Laura J.

AU - Turner, Stephen D.

AU - Wilke, Russell A.

AU - Ritchie, Marylyn D.

PY - 2011/12

Y1 - 2011/12

N2 - Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient reuse of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of 14 phenotypes for extraction of study samples from each site's DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample and marker quality and various batch effects. Upon completion of the genotyping and QC analyses for each site's primary study, eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset reentered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here, we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II, and also serve as a starting point for investigators merging multiple genotype datasets accessible through the National Center for Biotechnology Information in the database of Genotypes and Phenotypes. Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.

AB - Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient reuse of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of 14 phenotypes for extraction of study samples from each site's DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample and marker quality and various batch effects. Upon completion of the genotyping and QC analyses for each site's primary study, eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset reentered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here, we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II, and also serve as a starting point for investigators merging multiple genotype datasets accessible through the National Center for Biotechnology Information in the database of Genotypes and Phenotypes. Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.

KW - DbGaP

KW - EMERGE

KW - Genome-wide association (GWAS)

KW - Merging datasets

KW - Quality control

UR - http://www.scopus.com/inward/record.url?scp=82355161508&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=82355161508&partnerID=8YFLogxK

U2 - 10.1002/gepi.20639

DO - 10.1002/gepi.20639

M3 - Article

C2 - 22125226

AN - SCOPUS:82355161508

SN - 0741-0395

VL - 35

SP - 887

EP - 898

JO - Genetic epidemiology

JF - Genetic epidemiology

IS - 8

ER -

Pitfalls of merging GWAS data: Lessons learned in the eMERGE network and quality control procedures to maintain high data quality

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this