Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data

Suyash S. Shringarpure; Rasika A. Mathias; Ryan D. Hernandez; Timothy D. O'Connor; Zachary A. Szpiech; Raul Torres; Francisco M. De La Vega; Carlos D. Bustamante; Kathleen C. Barnes; Margaret A. Taub

doi:10.1093/bioinformatics/btw786

Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data

Suyash S. Shringarpure, Rasika A. Mathias, Ryan D. Hernandez, Timothy D. O'Connor, Zachary A. Szpiech, Raul Torres, Francisco M. De La Vega, Carlos D. Bustamante, Kathleen C. Barnes, Margaret A. Taub

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Motivation: Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). Results: We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies.

Original language	English (US)
Pages (from-to)	1147-1153
Number of pages	7
Journal	Bioinformatics
Volume	33
Issue number	8
DOIs	https://doi.org/10.1093/bioinformatics/btw786
State	Published - Apr 15 2017

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btw786

Cite this

Shringarpure, S. S., Mathias, R. A., Hernandez, R. D., O'Connor, T. D., Szpiech, Z. A., Torres, R., De La Vega, F. M., Bustamante, C. D., Barnes, K. C., & Taub, M. A. (2017). Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data. Bioinformatics, 33(8), 1147-1153. https://doi.org/10.1093/bioinformatics/btw786

Shringarpure, SS, Mathias, RA, Hernandez, RD, O'Connor, TD, Szpiech, ZA, Torres, R, De La Vega, FM, Bustamante, CD, Barnes, KC & Taub, MA 2017, 'Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data', Bioinformatics, vol. 33, no. 8, pp. 1147-1153. https://doi.org/10.1093/bioinformatics/btw786

@article{079a6f8fdd0b430b8d6124077052a857,

title = "Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data",

abstract = "Motivation: Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). Results: We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies.",

author = "Shringarpure, {Suyash S.} and Mathias, {Rasika A.} and Hernandez, {Ryan D.} and O'Connor, {Timothy D.} and Szpiech, {Zachary A.} and Raul Torres and {De La Vega}, {Francisco M.} and Bustamante, {Carlos D.} and Barnes, {Kathleen C.} and Taub, {Margaret A.}",

note = "Funding Information: for this study was provided by National Institutes of Health (NIH) R01HL104608/HL/NHLBI. Publisher Copyright: {\textcopyright} 2016 The Author. Published by Oxford University Press.",

year = "2017",

month = apr,

day = "15",

doi = "10.1093/bioinformatics/btw786",

language = "English (US)",

volume = "33",

pages = "1147--1153",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "8",

}

TY - JOUR

T1 - Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data

AU - Shringarpure, Suyash S.

AU - Mathias, Rasika A.

AU - Hernandez, Ryan D.

AU - O'Connor, Timothy D.

AU - Szpiech, Zachary A.

AU - Torres, Raul

AU - De La Vega, Francisco M.

AU - Bustamante, Carlos D.

AU - Barnes, Kathleen C.

AU - Taub, Margaret A.

PY - 2017/4/15

Y1 - 2017/4/15

N2 - Motivation: Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). Results: We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies.

AB - Motivation: Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). Results: We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies.

UR - http://www.scopus.com/inward/record.url?scp=85019054886&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85019054886&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btw786

DO - 10.1093/bioinformatics/btw786

M3 - Article

C2 - 28035032

AN - SCOPUS:85019054886

SN - 1367-4803

VL - 33

SP - 1147

EP - 1153

JO - Bioinformatics

JF - Bioinformatics

IS - 8

ER -

Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this