Robust k-mer frequency estimation using gapped k-mers

Mahmoud Ghandi, Morteza Mohammad-Noori, Michael Beer

Research output: Contribution to journalArticle

Abstract

Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

Original languageEnglish (US)
Pages (from-to)469-500
Number of pages32
JournalJournal of Mathematical Biology
Volume69
Issue number2
DOIs
StatePublished - 2014

Fingerprint

Frequency Estimation
Frequency estimation
DNA sequences
Position-Specific Scoring Matrices
Human Genome
Binding sites
Parameterization
Oligomers
DNA
Genes
Binding Sites
Learning
DNA Sequence
Count
Estimate
nucleotide sequences
Statistical Learning
Spatial Correlation
Sparse matrix
Descriptors

Keywords

  • DNA sequence
  • Frequency estimation
  • k-mer
  • Oligomer
  • Statistical learning

ASJC Scopus subject areas

  • Agricultural and Biological Sciences (miscellaneous)
  • Applied Mathematics
  • Modeling and Simulation
  • Medicine(all)

Cite this

Robust k-mer frequency estimation using gapped k-mers. / Ghandi, Mahmoud; Mohammad-Noori, Morteza; Beer, Michael.

In: Journal of Mathematical Biology, Vol. 69, No. 2, 2014, p. 469-500.

Research output: Contribution to journalArticle

Ghandi, Mahmoud ; Mohammad-Noori, Morteza ; Beer, Michael. / Robust k-mer frequency estimation using gapped k-mers. In: Journal of Mathematical Biology. 2014 ; Vol. 69, No. 2. pp. 469-500.
@article{cbc6b70fc1b140d0a5a4823126034b3e,
title = "Robust k-mer frequency estimation using gapped k-mers",
abstract = "Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.",
keywords = "DNA sequence, Frequency estimation, k-mer, Oligomer, Statistical learning",
author = "Mahmoud Ghandi and Morteza Mohammad-Noori and Michael Beer",
year = "2014",
doi = "10.1007/s00285-013-0705-3",
language = "English (US)",
volume = "69",
pages = "469--500",
journal = "Journal of Mathematical Biology",
issn = "0303-6812",
publisher = "Springer Verlag",
number = "2",

}

TY - JOUR

T1 - Robust k-mer frequency estimation using gapped k-mers

AU - Ghandi, Mahmoud

AU - Mohammad-Noori, Morteza

AU - Beer, Michael

PY - 2014

Y1 - 2014

N2 - Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

AB - Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

KW - DNA sequence

KW - Frequency estimation

KW - k-mer

KW - Oligomer

KW - Statistical learning

UR - http://www.scopus.com/inward/record.url?scp=84904385033&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84904385033&partnerID=8YFLogxK

U2 - 10.1007/s00285-013-0705-3

DO - 10.1007/s00285-013-0705-3

M3 - Article

VL - 69

SP - 469

EP - 500

JO - Journal of Mathematical Biology

JF - Journal of Mathematical Biology

SN - 0303-6812

IS - 2

ER -