Phred-Phrap package to analyses tools: A pipeline to facilitate population genetics re-sequencing studies

Moara Machado; Wagner C.S. Magalhães; Allan Sene; Bruno Araújo; Alessandra C. Faria-Campos; Stephen J. Chanock; Leandro Scott; Guilherme Oliveira; Eduardo Tarazona-Santos; Maira R. Rodrigues

doi:10.1186/2041-2223-2-3

Phred-Phrap package to analyses tools: A pipeline to facilitate population genetics re-sequencing studies

Moara Machado, Wagner C.S. Magalhães, Allan Sene, Bruno Araújo, Alessandra C. Faria-Campos, Stephen J. Chanock, Leandro Scott, Guilherme Oliveira, Eduardo Tarazona-Santos, Maira R. Rodrigues

Research output: Contribution to journal › Article › peer-review

30 Scopus citations

Abstract

Background: Targeted re-sequencing is one of the most powerful and widely used strategies for population genetics studies because it allows an unbiased screening for variation that is suitable for a wide variety of organisms. Examples of studies that require re-sequencing data are evolutionary inferences, epidemiological studies designed to capture rare polymorphisms responsible for complex traits and screenings for mutations in families and small populations with high incidences of specific genetic diseases. Despite the advent of next-generation sequencing technologies, Sanger sequencing is still the most popular approach in population genetics studies because of the widespread availability of automatic sequencers based on capillary electrophoresis and because it is still less prone to sequencing errors, which is critical in population genetics studies. Two popular software applications for re-sequencing studies are Phred-Phrap-Consed-Polyphred, which performs base calling, alignment, graphical edition and genotype calling and DNAsp, which performs a set of population genetics analyses. These independent tools are the start and end points of basic analyses. In between the use of these tools, there is a set of basic but error-prone tasks to be performed with re-sequencing data.Results: In order to assist with these intermediate tasks, we developed a pipeline that facilitates data handling typical of re-sequencing studies. Our pipeline: (1) consolidates different outputs produced by distinct Phred-Phrap-Consed contigs sharing a reference sequence; (2) checks for genotyping inconsistencies; (3) reformats genotyping data produced by Polyphred into a matrix of genotypes with individuals as rows and segregating sites as columns; (4) prepares input files for haplotype inferences using the popular software PHASE; and (5) handles PHASE output files that contain only polymorphic sites to reconstruct the inferred haplotypes including polymorphic and monomorphic sites as required by population genetics software for re-sequencing data such as DNAsp.Conclusion: We tested the pipeline in re-sequencing studies of haploid and diploid data in humans, plants, animals and microorganisms and observed that it allowed a substantial decrease in the time required for sequencing analyses, as well as being a more controlled process that eliminates several classes of error that may occur when handling datasets. The pipeline is also useful for investigators using other tools for sequencing and population genetics analyses.

Original language	English (US)
Article number	3
Journal	Investigative Genetics
Volume	2
Issue number	1
DOIs	https://doi.org/10.1186/2041-2223-2-3
State	Published - Feb 1 2011
Externally published	Yes

ASJC Scopus subject areas

Pathology and Forensic Medicine
Molecular Biology
Genetics

Access to Document

10.1186/2041-2223-2-3

Cite this

@article{e418435e6f8549fdb10f40745248d539,

title = "Phred-Phrap package to analyses tools: A pipeline to facilitate population genetics re-sequencing studies",

abstract = "Background: Targeted re-sequencing is one of the most powerful and widely used strategies for population genetics studies because it allows an unbiased screening for variation that is suitable for a wide variety of organisms. Examples of studies that require re-sequencing data are evolutionary inferences, epidemiological studies designed to capture rare polymorphisms responsible for complex traits and screenings for mutations in families and small populations with high incidences of specific genetic diseases. Despite the advent of next-generation sequencing technologies, Sanger sequencing is still the most popular approach in population genetics studies because of the widespread availability of automatic sequencers based on capillary electrophoresis and because it is still less prone to sequencing errors, which is critical in population genetics studies. Two popular software applications for re-sequencing studies are Phred-Phrap-Consed-Polyphred, which performs base calling, alignment, graphical edition and genotype calling and DNAsp, which performs a set of population genetics analyses. These independent tools are the start and end points of basic analyses. In between the use of these tools, there is a set of basic but error-prone tasks to be performed with re-sequencing data.Results: In order to assist with these intermediate tasks, we developed a pipeline that facilitates data handling typical of re-sequencing studies. Our pipeline: (1) consolidates different outputs produced by distinct Phred-Phrap-Consed contigs sharing a reference sequence; (2) checks for genotyping inconsistencies; (3) reformats genotyping data produced by Polyphred into a matrix of genotypes with individuals as rows and segregating sites as columns; (4) prepares input files for haplotype inferences using the popular software PHASE; and (5) handles PHASE output files that contain only polymorphic sites to reconstruct the inferred haplotypes including polymorphic and monomorphic sites as required by population genetics software for re-sequencing data such as DNAsp.Conclusion: We tested the pipeline in re-sequencing studies of haploid and diploid data in humans, plants, animals and microorganisms and observed that it allowed a substantial decrease in the time required for sequencing analyses, as well as being a more controlled process that eliminates several classes of error that may occur when handling datasets. The pipeline is also useful for investigators using other tools for sequencing and population genetics analyses.",

author = "Moara Machado and Magalh{\~a}es, {Wagner C.S.} and Allan Sene and Bruno Ara{\'u}jo and Faria-Campos, {Alessandra C.} and Chanock, {Stephen J.} and Leandro Scott and Guilherme Oliveira and Eduardo Tarazona-Santos and Rodrigues, {Maira R.}",

note = "Funding Information: We are grateful to Flavia Siqueira, Rodrigo Redondo, Renata Acacio, Sharon Savage and Charles Chung for helping us test the pipeline and to the Bioinformatics group of the Core Genotyping Facilities of NCI for their participation in discussions about the pipeline. This work is supported by the National Institutes of Health - Fogarty International Center (1R01TW007894-01 to ETS), Brazilian National Research Council (CNPq), Brazilian Ministry of Education (CAPES Agency) and Minas Gerais State Foundation in Aid of Research (FAPEMIG - CBB-1181/08, PPM-00439-10), CNPq (306879/2009-3) and NIH-Fogarty (TW007012).",

year = "2011",

month = feb,

day = "1",

doi = "10.1186/2041-2223-2-3",

language = "English (US)",

volume = "2",

journal = "Investigative Genetics",

issn = "2041-2223",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Phred-Phrap package to analyses tools

T2 - A pipeline to facilitate population genetics re-sequencing studies

AU - Machado, Moara

AU - Magalhães, Wagner C.S.

AU - Sene, Allan

AU - Araújo, Bruno

AU - Faria-Campos, Alessandra C.

AU - Chanock, Stephen J.

AU - Scott, Leandro

AU - Oliveira, Guilherme

AU - Tarazona-Santos, Eduardo

AU - Rodrigues, Maira R.

N1 - Funding Information: We are grateful to Flavia Siqueira, Rodrigo Redondo, Renata Acacio, Sharon Savage and Charles Chung for helping us test the pipeline and to the Bioinformatics group of the Core Genotyping Facilities of NCI for their participation in discussions about the pipeline. This work is supported by the National Institutes of Health - Fogarty International Center (1R01TW007894-01 to ETS), Brazilian National Research Council (CNPq), Brazilian Ministry of Education (CAPES Agency) and Minas Gerais State Foundation in Aid of Research (FAPEMIG - CBB-1181/08, PPM-00439-10), CNPq (306879/2009-3) and NIH-Fogarty (TW007012).

PY - 2011/2/1

Y1 - 2011/2/1

N2 - Background: Targeted re-sequencing is one of the most powerful and widely used strategies for population genetics studies because it allows an unbiased screening for variation that is suitable for a wide variety of organisms. Examples of studies that require re-sequencing data are evolutionary inferences, epidemiological studies designed to capture rare polymorphisms responsible for complex traits and screenings for mutations in families and small populations with high incidences of specific genetic diseases. Despite the advent of next-generation sequencing technologies, Sanger sequencing is still the most popular approach in population genetics studies because of the widespread availability of automatic sequencers based on capillary electrophoresis and because it is still less prone to sequencing errors, which is critical in population genetics studies. Two popular software applications for re-sequencing studies are Phred-Phrap-Consed-Polyphred, which performs base calling, alignment, graphical edition and genotype calling and DNAsp, which performs a set of population genetics analyses. These independent tools are the start and end points of basic analyses. In between the use of these tools, there is a set of basic but error-prone tasks to be performed with re-sequencing data.Results: In order to assist with these intermediate tasks, we developed a pipeline that facilitates data handling typical of re-sequencing studies. Our pipeline: (1) consolidates different outputs produced by distinct Phred-Phrap-Consed contigs sharing a reference sequence; (2) checks for genotyping inconsistencies; (3) reformats genotyping data produced by Polyphred into a matrix of genotypes with individuals as rows and segregating sites as columns; (4) prepares input files for haplotype inferences using the popular software PHASE; and (5) handles PHASE output files that contain only polymorphic sites to reconstruct the inferred haplotypes including polymorphic and monomorphic sites as required by population genetics software for re-sequencing data such as DNAsp.Conclusion: We tested the pipeline in re-sequencing studies of haploid and diploid data in humans, plants, animals and microorganisms and observed that it allowed a substantial decrease in the time required for sequencing analyses, as well as being a more controlled process that eliminates several classes of error that may occur when handling datasets. The pipeline is also useful for investigators using other tools for sequencing and population genetics analyses.

AB - Background: Targeted re-sequencing is one of the most powerful and widely used strategies for population genetics studies because it allows an unbiased screening for variation that is suitable for a wide variety of organisms. Examples of studies that require re-sequencing data are evolutionary inferences, epidemiological studies designed to capture rare polymorphisms responsible for complex traits and screenings for mutations in families and small populations with high incidences of specific genetic diseases. Despite the advent of next-generation sequencing technologies, Sanger sequencing is still the most popular approach in population genetics studies because of the widespread availability of automatic sequencers based on capillary electrophoresis and because it is still less prone to sequencing errors, which is critical in population genetics studies. Two popular software applications for re-sequencing studies are Phred-Phrap-Consed-Polyphred, which performs base calling, alignment, graphical edition and genotype calling and DNAsp, which performs a set of population genetics analyses. These independent tools are the start and end points of basic analyses. In between the use of these tools, there is a set of basic but error-prone tasks to be performed with re-sequencing data.Results: In order to assist with these intermediate tasks, we developed a pipeline that facilitates data handling typical of re-sequencing studies. Our pipeline: (1) consolidates different outputs produced by distinct Phred-Phrap-Consed contigs sharing a reference sequence; (2) checks for genotyping inconsistencies; (3) reformats genotyping data produced by Polyphred into a matrix of genotypes with individuals as rows and segregating sites as columns; (4) prepares input files for haplotype inferences using the popular software PHASE; and (5) handles PHASE output files that contain only polymorphic sites to reconstruct the inferred haplotypes including polymorphic and monomorphic sites as required by population genetics software for re-sequencing data such as DNAsp.Conclusion: We tested the pipeline in re-sequencing studies of haploid and diploid data in humans, plants, animals and microorganisms and observed that it allowed a substantial decrease in the time required for sequencing analyses, as well as being a more controlled process that eliminates several classes of error that may occur when handling datasets. The pipeline is also useful for investigators using other tools for sequencing and population genetics analyses.

UR - http://www.scopus.com/inward/record.url?scp=81055127876&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=81055127876&partnerID=8YFLogxK

U2 - 10.1186/2041-2223-2-3

DO - 10.1186/2041-2223-2-3

M3 - Article

C2 - 21284835

AN - SCOPUS:81055127876

SN - 2041-2223

VL - 2

JO - Investigative Genetics

JF - Investigative Genetics

IS - 1

M1 - 3

ER -

Phred-Phrap package to analyses tools: A pipeline to facilitate population genetics re-sequencing studies

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this