Finding anchors for genomic sequence comparison

Ross A. Lippert; Xiaoyue Zhao; Liliana Florea; Clark Mobarry; Sorin Istrail

doi:10.1089/cmb.2005.12.762

Finding anchors for genomic sequence comparison

Ross A. Lippert, Xiaoyue Zhao, Liliana Florea, Clark Mobarry, Sorin Istrail

Research output: Contribution to journal › Article › peer-review

6 Scopus citations

Abstract

Recent sequencing of the human and other mammalian genomes has brought about the necessity to align them, to identify and characterize their commonalities and differences. Programs that align whole genomes generally use a seed-and-extend technique, starting from exact or near-exact matches and selecting a reliable subset of these, called anchors, and then filling in the remaining portions between the anchors using a combination of local and global alignment algorithms, but their choices for the parameters so far have been primarily heuristic. We present a statistical framework and practical methods for selecting a set of matches that is both sensitive and specific and can constitute a reliable set of anchors for a one-to-one mapping of two genomes from which a whole-genome alignment can be built. Starting from exact matches, we introduce a novel per-base repeat annotation, the Z-score, from which noise and repeat filtering conditions are explored. Dynamic programming-based chaining algorithms are also evaluated as context-based filters. We apply the methods described here to the comparison of two progressive assemblies of the human genome, NCBI build 28 and build 34 (www.genome.ucsc.edu), and show that a significant portion of the two genomes can be found in selected exact matches, with very limited amount of sequence duplication.

Original language	English (US)
Pages (from-to)	762-776
Number of pages	15
Journal	Journal of Computational Biology
Volume	12
Issue number	6
DOIs	https://doi.org/10.1089/cmb.2005.12.762
State	Published - Jul 2005
Externally published	Yes

Keywords

Suffix trees
Whole-genome alignments

ASJC Scopus subject areas

Modeling and Simulation
Molecular Biology
Genetics
Computational Mathematics
Computational Theory and Mathematics

Access to Document

10.1089/cmb.2005.12.762

Cite this

@article{f86a9171ec684a7d9dbd177ac7580898,

title = "Finding anchors for genomic sequence comparison",

abstract = "Recent sequencing of the human and other mammalian genomes has brought about the necessity to align them, to identify and characterize their commonalities and differences. Programs that align whole genomes generally use a seed-and-extend technique, starting from exact or near-exact matches and selecting a reliable subset of these, called anchors, and then filling in the remaining portions between the anchors using a combination of local and global alignment algorithms, but their choices for the parameters so far have been primarily heuristic. We present a statistical framework and practical methods for selecting a set of matches that is both sensitive and specific and can constitute a reliable set of anchors for a one-to-one mapping of two genomes from which a whole-genome alignment can be built. Starting from exact matches, we introduce a novel per-base repeat annotation, the Z-score, from which noise and repeat filtering conditions are explored. Dynamic programming-based chaining algorithms are also evaluated as context-based filters. We apply the methods described here to the comparison of two progressive assemblies of the human genome, NCBI build 28 and build 34 (www.genome.ucsc.edu), and show that a significant portion of the two genomes can be found in selected exact matches, with very limited amount of sequence duplication.",

keywords = "Suffix trees, Whole-genome alignments",

author = "Lippert, {Ross A.} and Xiaoyue Zhao and Liliana Florea and Clark Mobarry and Sorin Istrail",

year = "2005",

month = jul,

doi = "10.1089/cmb.2005.12.762",

language = "English (US)",

volume = "12",

pages = "762--776",

journal = "Journal of Computational Biology",

issn = "1066-5277",

publisher = "Mary Ann Liebert Inc.",

number = "6",

}

TY - JOUR

T1 - Finding anchors for genomic sequence comparison

AU - Lippert, Ross A.

AU - Zhao, Xiaoyue

AU - Florea, Liliana

AU - Mobarry, Clark

AU - Istrail, Sorin

PY - 2005/7

Y1 - 2005/7

N2 - Recent sequencing of the human and other mammalian genomes has brought about the necessity to align them, to identify and characterize their commonalities and differences. Programs that align whole genomes generally use a seed-and-extend technique, starting from exact or near-exact matches and selecting a reliable subset of these, called anchors, and then filling in the remaining portions between the anchors using a combination of local and global alignment algorithms, but their choices for the parameters so far have been primarily heuristic. We present a statistical framework and practical methods for selecting a set of matches that is both sensitive and specific and can constitute a reliable set of anchors for a one-to-one mapping of two genomes from which a whole-genome alignment can be built. Starting from exact matches, we introduce a novel per-base repeat annotation, the Z-score, from which noise and repeat filtering conditions are explored. Dynamic programming-based chaining algorithms are also evaluated as context-based filters. We apply the methods described here to the comparison of two progressive assemblies of the human genome, NCBI build 28 and build 34 (www.genome.ucsc.edu), and show that a significant portion of the two genomes can be found in selected exact matches, with very limited amount of sequence duplication.

AB - Recent sequencing of the human and other mammalian genomes has brought about the necessity to align them, to identify and characterize their commonalities and differences. Programs that align whole genomes generally use a seed-and-extend technique, starting from exact or near-exact matches and selecting a reliable subset of these, called anchors, and then filling in the remaining portions between the anchors using a combination of local and global alignment algorithms, but their choices for the parameters so far have been primarily heuristic. We present a statistical framework and practical methods for selecting a set of matches that is both sensitive and specific and can constitute a reliable set of anchors for a one-to-one mapping of two genomes from which a whole-genome alignment can be built. Starting from exact matches, we introduce a novel per-base repeat annotation, the Z-score, from which noise and repeat filtering conditions are explored. Dynamic programming-based chaining algorithms are also evaluated as context-based filters. We apply the methods described here to the comparison of two progressive assemblies of the human genome, NCBI build 28 and build 34 (www.genome.ucsc.edu), and show that a significant portion of the two genomes can be found in selected exact matches, with very limited amount of sequence duplication.

KW - Suffix trees

KW - Whole-genome alignments

UR - http://www.scopus.com/inward/record.url?scp=23844538624&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=23844538624&partnerID=8YFLogxK

U2 - 10.1089/cmb.2005.12.762

DO - 10.1089/cmb.2005.12.762

M3 - Article

C2 - 16108715

AN - SCOPUS:23844538624

SN - 1066-5277

VL - 12

SP - 762

EP - 776

JO - Journal of Computational Biology

JF - Journal of Computational Biology

IS - 6

ER -

Finding anchors for genomic sequence comparison

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this