Abstract
Missing single nucleotide polymorphisms (SNPs) are quite common in genetic association studies. Subjects with missing SNPs are often discarded in analyses, which may seriously undermine the inference of SNP-disease association. In this article, we develop two haplotype-based imputation approaches and one tree-based imputation approach for association studies. The emphasis is to evaluate the impact of imputation on parameter estimation, compared to the standard practice of ignoring missing data. Haplotype-based approaches build on haplotype reconstruction by the expectation-maximization (EM) algorithm or a weighted EM (WEM) algorithm, depending on whether case-control status is taken into account. The tree-based approach uses a Gibbs sampler to iteratively sample from a full conditional distribution, which is obtained from the classification and regression tree (CART) algorithm. We employ a standard multiple imputation procedure to account for the uncertainty of imputation. We apply the methods to simulated data as well as a case-control study on developmental dyslexia. Our results suggest that imputation generally improves efficiency over the standard practice of ignoring missing data. The tree-based approach performs comparably well as haplotype-based approaches, but the former has a computational advantage. The WEM approach yields the smallest bias at a price of increased variance.
Original language | English (US) |
---|---|
Pages (from-to) | 690-702 |
Number of pages | 13 |
Journal | Genetic epidemiology |
Volume | 30 |
Issue number | 8 |
DOIs | |
State | Published - Dec 2006 |
Keywords
- CART
- EM algorithm
- Gibbs sampler
- Linkage disequilibrium
- Multiple imputation
ASJC Scopus subject areas
- Epidemiology
- Genetics(clinical)