A species-generalized probabilistic model-based definition of CpG islands

Rafael A. Irizarry, Hao Wu, Andrew P Feinberg

Research output: Contribution to journalArticle

Abstract

The DNA of most vertebrates is depleted in CpG dinucleotides, the target for DNA methylation. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). CGI have been useful as marking functionally relevant epigenetic loci for genome studies. For example, CGI are enriched in the promoters of vertebrate genes and thought to play an important role in regulation. Currently, CGI are defined algorithmically as an observed-to-expected ratio (O/E) of CpG greater than 0.6, G+C content greater than 0.5, and usually but not necessarily greater than a certain length. Here we find that the current definition leaves out important CpG clusters associated with epigenetic marks, relevant to development and disease, and does not apply at all to nonvertabrate genomes. We propose an alternative Hidden Markov model-based approach that solves these problems. We fit our model to genomes from 30 species, and the results support a new epigenomic view toward the development of DNA methylation in species diversity and evolution. The O/E of CpG in islands and nonislands segregated closely phylogenetically and showed substantial loss in both groups in animals of greater complexity, while maintaining a nearly constant difference in CpG O/E between islands and nonisland compartments. Lists of CGI for some species are available at http://www.rafalab.org .

Original languageEnglish (US)
Pages (from-to)674-680
Number of pages7
JournalMammalian Genome
Volume20
Issue number9-10
DOIs
StatePublished - Oct 2009

Fingerprint

CpG Islands
Statistical Models
Epigenomics
Genome
DNA Methylation
Vertebrates
Base Composition
Islands
DNA
Genes

ASJC Scopus subject areas

  • Genetics

Cite this

A species-generalized probabilistic model-based definition of CpG islands. / Irizarry, Rafael A.; Wu, Hao; Feinberg, Andrew P.

In: Mammalian Genome, Vol. 20, No. 9-10, 10.2009, p. 674-680.

Research output: Contribution to journalArticle

Irizarry, Rafael A. ; Wu, Hao ; Feinberg, Andrew P. / A species-generalized probabilistic model-based definition of CpG islands. In: Mammalian Genome. 2009 ; Vol. 20, No. 9-10. pp. 674-680.
@article{a08d3d87ecc94160ad06aa451835529c,
title = "A species-generalized probabilistic model-based definition of CpG islands",
abstract = "The DNA of most vertebrates is depleted in CpG dinucleotides, the target for DNA methylation. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). CGI have been useful as marking functionally relevant epigenetic loci for genome studies. For example, CGI are enriched in the promoters of vertebrate genes and thought to play an important role in regulation. Currently, CGI are defined algorithmically as an observed-to-expected ratio (O/E) of CpG greater than 0.6, G+C content greater than 0.5, and usually but not necessarily greater than a certain length. Here we find that the current definition leaves out important CpG clusters associated with epigenetic marks, relevant to development and disease, and does not apply at all to nonvertabrate genomes. We propose an alternative Hidden Markov model-based approach that solves these problems. We fit our model to genomes from 30 species, and the results support a new epigenomic view toward the development of DNA methylation in species diversity and evolution. The O/E of CpG in islands and nonislands segregated closely phylogenetically and showed substantial loss in both groups in animals of greater complexity, while maintaining a nearly constant difference in CpG O/E between islands and nonisland compartments. Lists of CGI for some species are available at http://www.rafalab.org .",
author = "Irizarry, {Rafael A.} and Hao Wu and Feinberg, {Andrew P}",
year = "2009",
month = "10",
doi = "10.1007/s00335-009-9222-5",
language = "English (US)",
volume = "20",
pages = "674--680",
journal = "Mammalian Genome",
issn = "0938-8990",
publisher = "Springer New York",
number = "9-10",

}

TY - JOUR

T1 - A species-generalized probabilistic model-based definition of CpG islands

AU - Irizarry, Rafael A.

AU - Wu, Hao

AU - Feinberg, Andrew P

PY - 2009/10

Y1 - 2009/10

N2 - The DNA of most vertebrates is depleted in CpG dinucleotides, the target for DNA methylation. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). CGI have been useful as marking functionally relevant epigenetic loci for genome studies. For example, CGI are enriched in the promoters of vertebrate genes and thought to play an important role in regulation. Currently, CGI are defined algorithmically as an observed-to-expected ratio (O/E) of CpG greater than 0.6, G+C content greater than 0.5, and usually but not necessarily greater than a certain length. Here we find that the current definition leaves out important CpG clusters associated with epigenetic marks, relevant to development and disease, and does not apply at all to nonvertabrate genomes. We propose an alternative Hidden Markov model-based approach that solves these problems. We fit our model to genomes from 30 species, and the results support a new epigenomic view toward the development of DNA methylation in species diversity and evolution. The O/E of CpG in islands and nonislands segregated closely phylogenetically and showed substantial loss in both groups in animals of greater complexity, while maintaining a nearly constant difference in CpG O/E between islands and nonisland compartments. Lists of CGI for some species are available at http://www.rafalab.org .

AB - The DNA of most vertebrates is depleted in CpG dinucleotides, the target for DNA methylation. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). CGI have been useful as marking functionally relevant epigenetic loci for genome studies. For example, CGI are enriched in the promoters of vertebrate genes and thought to play an important role in regulation. Currently, CGI are defined algorithmically as an observed-to-expected ratio (O/E) of CpG greater than 0.6, G+C content greater than 0.5, and usually but not necessarily greater than a certain length. Here we find that the current definition leaves out important CpG clusters associated with epigenetic marks, relevant to development and disease, and does not apply at all to nonvertabrate genomes. We propose an alternative Hidden Markov model-based approach that solves these problems. We fit our model to genomes from 30 species, and the results support a new epigenomic view toward the development of DNA methylation in species diversity and evolution. The O/E of CpG in islands and nonislands segregated closely phylogenetically and showed substantial loss in both groups in animals of greater complexity, while maintaining a nearly constant difference in CpG O/E between islands and nonisland compartments. Lists of CGI for some species are available at http://www.rafalab.org .

UR - http://www.scopus.com/inward/record.url?scp=72849117741&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=72849117741&partnerID=8YFLogxK

U2 - 10.1007/s00335-009-9222-5

DO - 10.1007/s00335-009-9222-5

M3 - Article

VL - 20

SP - 674

EP - 680

JO - Mammalian Genome

JF - Mammalian Genome

SN - 0938-8990

IS - 9-10

ER -