Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features

Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, Michael Beer

Research output: Contribution to journalArticle

Abstract

Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

Original languageEnglish (US)
Article numbere1003711
JournalPLoS Computational Biology
Volume10
Issue number7
DOIs
StatePublished - 2014

Fingerprint

regulatory sequences
Classifiers
prediction
Prediction
Oligomers
Data structures
DNA
Genes
Bayes Classifier
Tissue
Proteins
Functional Genomics
Statistical Learning
Binary Variables
Alternatives
Robust Estimation
Sequence Analysis
Protein Sequence
Tree Structure
methodology

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Modeling and Simulation
  • Ecology, Evolution, Behavior and Systematics
  • Genetics
  • Molecular Biology
  • Ecology
  • Cellular and Molecular Neuroscience
  • Medicine(all)

Cite this

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. / Ghandi, Mahmoud; Lee, Dongwon; Mohammad-Noori, Morteza; Beer, Michael.

In: PLoS Computational Biology, Vol. 10, No. 7, e1003711, 2014.

Research output: Contribution to journalArticle

Ghandi, Mahmoud ; Lee, Dongwon ; Mohammad-Noori, Morteza ; Beer, Michael. / Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. In: PLoS Computational Biology. 2014 ; Vol. 10, No. 7.
@article{08332934581d40c1a11d359deda059b6,
title = "Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features",
abstract = "Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Na{\"i}ve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.",
author = "Mahmoud Ghandi and Dongwon Lee and Morteza Mohammad-Noori and Michael Beer",
year = "2014",
doi = "10.1371/journal.pcbi.1003711",
language = "English (US)",
volume = "10",
journal = "PLoS Computational Biology",
issn = "1553-734X",
publisher = "Public Library of Science",
number = "7",

}

TY - JOUR

T1 - Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features

AU - Ghandi, Mahmoud

AU - Lee, Dongwon

AU - Mohammad-Noori, Morteza

AU - Beer, Michael

PY - 2014

Y1 - 2014

N2 - Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

AB - Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

UR - http://www.scopus.com/inward/record.url?scp=84905484602&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84905484602&partnerID=8YFLogxK

U2 - 10.1371/journal.pcbi.1003711

DO - 10.1371/journal.pcbi.1003711

M3 - Article

C2 - 25033408

AN - SCOPUS:84905484602

VL - 10

JO - PLoS Computational Biology

JF - PLoS Computational Biology

SN - 1553-734X

IS - 7

M1 - e1003711

ER -