EnsembleCNV: An ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data

Zhongyang Zhang, Haoxiang Cheng, Xiumei Hong, Antonio F. DI Narzo, Oscar Franzen, Shouneng Peng, Arno Ruusalepp, Jason C. Kovacic, Johan L.M. Bjorkegren, Xiaobin Wang, Ke Hao

Research output: Contribution to journalArticlepeer-review

Abstract

The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.

Original languageEnglish (US)
Article numbere39
JournalNucleic acids research
Volume47
Issue number7
DOIs
StatePublished - Apr 23 2019

ASJC Scopus subject areas

  • Genetics

Fingerprint

Dive into the research topics of 'EnsembleCNV: An ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data'. Together they form a unique fingerprint.

Cite this