Designing sensitive and specific spaced seeds for cross-species mRNA-to-genome alignment

Leming Zhou, Liliana D Florea

Research output: Contribution to journalArticle

Abstract

As the demand for accurately aligning gene sequences to the genome of a related species grows with the sequencing of new genomes, spaced seeds emerge as a promising vehicle for increasing alignment sensitivity. We extend the existing {0, 1} match-mismatch models for sensitivity evaluation to take into account the compositional structure of coding sequences and ultimately produce seeds better suited to this particular application. Designing seeds for alignment programs, however, needs to balance sensitivity and specificity. We assess the effects of seed variations on both sensitivity and specificity in an extended model that incorporates transitions and differentiates among the three codon positions, and show that spaced seeds with transitions offer a better sensitivity-specificity tradeoff. Furthermore, we propose a theoretical formulation for rigorously assessing seed specificity, starting from Bernoulli and Markov models of the mRNA and genomic sequences. Within this framework, we perform the first comprehensive analysis of seeds to serve as a blueprint for selecting sensitive and specific seeds for practical applications. Our analyses show that specificity is relatively constant for seeds of a given weight, while sensitivity varies widely, with the highest values attained by seeds allowing a small (2-6) number of transitions. A strategy for designing seeds, therefore, is to first select the weight of the seed by identifying the desired sensitivity-specificity tradeoff, then choose the most sensitive seed(s) within that weight group. We illustrate our methods with the alignment of chicken coding sequences against the human genome assembly version HG17.

Original languageEnglish (US)
Pages (from-to)113-130
Number of pages18
JournalJournal of Computational Biology
Volume14
Issue number2
DOIs
StatePublished - Mar 2007
Externally publishedYes

Fingerprint

Messenger RNA
Seed
Seeds
Genome
Alignment
Specificity
Genes
Coding
Trade-offs
Sensitivity and Specificity
Weights and Measures
Differentiate
Bernoulli
Markov Model
Sequencing
Genomics
Choose
Vary
Blueprints
Gene

Keywords

  • Markov model
  • mRNA-to-genome alignments
  • Sensitivity
  • Spaced seeds
  • Specificity

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics

Cite this

Designing sensitive and specific spaced seeds for cross-species mRNA-to-genome alignment. / Zhou, Leming; Florea, Liliana D.

In: Journal of Computational Biology, Vol. 14, No. 2, 03.2007, p. 113-130.

Research output: Contribution to journalArticle

@article{7ef290b3e74f44ef891dfb7a4cc0658a,
title = "Designing sensitive and specific spaced seeds for cross-species mRNA-to-genome alignment",
abstract = "As the demand for accurately aligning gene sequences to the genome of a related species grows with the sequencing of new genomes, spaced seeds emerge as a promising vehicle for increasing alignment sensitivity. We extend the existing {0, 1} match-mismatch models for sensitivity evaluation to take into account the compositional structure of coding sequences and ultimately produce seeds better suited to this particular application. Designing seeds for alignment programs, however, needs to balance sensitivity and specificity. We assess the effects of seed variations on both sensitivity and specificity in an extended model that incorporates transitions and differentiates among the three codon positions, and show that spaced seeds with transitions offer a better sensitivity-specificity tradeoff. Furthermore, we propose a theoretical formulation for rigorously assessing seed specificity, starting from Bernoulli and Markov models of the mRNA and genomic sequences. Within this framework, we perform the first comprehensive analysis of seeds to serve as a blueprint for selecting sensitive and specific seeds for practical applications. Our analyses show that specificity is relatively constant for seeds of a given weight, while sensitivity varies widely, with the highest values attained by seeds allowing a small (2-6) number of transitions. A strategy for designing seeds, therefore, is to first select the weight of the seed by identifying the desired sensitivity-specificity tradeoff, then choose the most sensitive seed(s) within that weight group. We illustrate our methods with the alignment of chicken coding sequences against the human genome assembly version HG17.",
keywords = "Markov model, mRNA-to-genome alignments, Sensitivity, Spaced seeds, Specificity",
author = "Leming Zhou and Florea, {Liliana D}",
year = "2007",
month = "3",
doi = "10.1089/cmb.2006.0130",
language = "English (US)",
volume = "14",
pages = "113--130",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "2",

}

TY - JOUR

T1 - Designing sensitive and specific spaced seeds for cross-species mRNA-to-genome alignment

AU - Zhou, Leming

AU - Florea, Liliana D

PY - 2007/3

Y1 - 2007/3

N2 - As the demand for accurately aligning gene sequences to the genome of a related species grows with the sequencing of new genomes, spaced seeds emerge as a promising vehicle for increasing alignment sensitivity. We extend the existing {0, 1} match-mismatch models for sensitivity evaluation to take into account the compositional structure of coding sequences and ultimately produce seeds better suited to this particular application. Designing seeds for alignment programs, however, needs to balance sensitivity and specificity. We assess the effects of seed variations on both sensitivity and specificity in an extended model that incorporates transitions and differentiates among the three codon positions, and show that spaced seeds with transitions offer a better sensitivity-specificity tradeoff. Furthermore, we propose a theoretical formulation for rigorously assessing seed specificity, starting from Bernoulli and Markov models of the mRNA and genomic sequences. Within this framework, we perform the first comprehensive analysis of seeds to serve as a blueprint for selecting sensitive and specific seeds for practical applications. Our analyses show that specificity is relatively constant for seeds of a given weight, while sensitivity varies widely, with the highest values attained by seeds allowing a small (2-6) number of transitions. A strategy for designing seeds, therefore, is to first select the weight of the seed by identifying the desired sensitivity-specificity tradeoff, then choose the most sensitive seed(s) within that weight group. We illustrate our methods with the alignment of chicken coding sequences against the human genome assembly version HG17.

AB - As the demand for accurately aligning gene sequences to the genome of a related species grows with the sequencing of new genomes, spaced seeds emerge as a promising vehicle for increasing alignment sensitivity. We extend the existing {0, 1} match-mismatch models for sensitivity evaluation to take into account the compositional structure of coding sequences and ultimately produce seeds better suited to this particular application. Designing seeds for alignment programs, however, needs to balance sensitivity and specificity. We assess the effects of seed variations on both sensitivity and specificity in an extended model that incorporates transitions and differentiates among the three codon positions, and show that spaced seeds with transitions offer a better sensitivity-specificity tradeoff. Furthermore, we propose a theoretical formulation for rigorously assessing seed specificity, starting from Bernoulli and Markov models of the mRNA and genomic sequences. Within this framework, we perform the first comprehensive analysis of seeds to serve as a blueprint for selecting sensitive and specific seeds for practical applications. Our analyses show that specificity is relatively constant for seeds of a given weight, while sensitivity varies widely, with the highest values attained by seeds allowing a small (2-6) number of transitions. A strategy for designing seeds, therefore, is to first select the weight of the seed by identifying the desired sensitivity-specificity tradeoff, then choose the most sensitive seed(s) within that weight group. We illustrate our methods with the alignment of chicken coding sequences against the human genome assembly version HG17.

KW - Markov model

KW - mRNA-to-genome alignments

KW - Sensitivity

KW - Spaced seeds

KW - Specificity

UR - http://www.scopus.com/inward/record.url?scp=34248158342&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34248158342&partnerID=8YFLogxK

U2 - 10.1089/cmb.2006.0130

DO - 10.1089/cmb.2006.0130

M3 - Article

C2 - 17456011

AN - SCOPUS:34248158342

VL - 14

SP - 113

EP - 130

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 2

ER -