Quantifying uncertainty in genotype calls

Benilton S. Carvalho, Thomas Louis, Rafael A. Irizarry

Research output: Contribution to journalArticle

Abstract

Motivation: Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS.Results: We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches.

Original languageEnglish (US)
Article numberbtp624
Pages (from-to)242-249
Number of pages8
JournalBioinformatics
Volume26
Issue number2
DOIs
StatePublished - 2010

Fingerprint

Genome-Wide Association Study
Genotype
Uncertainty
Single nucleotide Polymorphism
Genes
Association reactions
Single Nucleotide Polymorphism
Nucleotides
Genome
Polymorphism
Batch
Microarrays
Microarray
Multilevel Models
Microarray Data
Locus
Disorder
Technology
Probe
Gene

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability

Cite this

Quantifying uncertainty in genotype calls. / Carvalho, Benilton S.; Louis, Thomas; Irizarry, Rafael A.

In: Bioinformatics, Vol. 26, No. 2, btp624, 2010, p. 242-249.

Research output: Contribution to journalArticle

Carvalho, BS, Louis, T & Irizarry, RA 2010, 'Quantifying uncertainty in genotype calls', Bioinformatics, vol. 26, no. 2, btp624, pp. 242-249. https://doi.org/10.1093/bioinformatics/btp624
Carvalho, Benilton S. ; Louis, Thomas ; Irizarry, Rafael A. / Quantifying uncertainty in genotype calls. In: Bioinformatics. 2010 ; Vol. 26, No. 2. pp. 242-249.
@article{5bab4d03b3b344318bdedc153528c792,
title = "Quantifying uncertainty in genotype calls",
abstract = "Motivation: Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS.Results: We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches.",
author = "Carvalho, {Benilton S.} and Thomas Louis and Irizarry, {Rafael A.}",
year = "2010",
doi = "10.1093/bioinformatics/btp624",
language = "English (US)",
volume = "26",
pages = "242--249",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "2",

}

TY - JOUR

T1 - Quantifying uncertainty in genotype calls

AU - Carvalho, Benilton S.

AU - Louis, Thomas

AU - Irizarry, Rafael A.

PY - 2010

Y1 - 2010

N2 - Motivation: Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS.Results: We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches.

AB - Motivation: Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS.Results: We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches.

UR - http://www.scopus.com/inward/record.url?scp=75249089803&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=75249089803&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btp624

DO - 10.1093/bioinformatics/btp624

M3 - Article

C2 - 19906825

AN - SCOPUS:75249089803

VL - 26

SP - 242

EP - 249

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 2

M1 - btp624

ER -