TY - JOUR
T1 - Ranking genomic features using an information-theoretic measure of epigenetic discordance
AU - Jenkinson, Garrett
AU - Abante, Jordi
AU - Koldobskiy, Michael A.
AU - Feinberg, Andrew P.
AU - Goutsias, John
N1 - Funding Information:
This work was supported by NIH Grants CA054358, HG008529 and NSF Grant CCF-1656201. MAK is a St. Baldrick’s Foundation Fellow and a Damon Runyon-Sohn Pediatric Cancer Fellow supported by the Damon Runyon Cancer Research Foundation (DRSG-15P-16). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Publisher Copyright:
© 2019 The Author(s).
PY - 2019/4/8
Y1 - 2019/4/8
N2 - Background: Establishment and maintenance of DNA methylation throughout the genome is an important epigenetic mechanism that regulates gene expression whose disruption has been implicated in human diseases like cancer. It is therefore crucial to know which genes, or other genomic features of interest, exhibit significant discordance in DNA methylation between two phenotypes. We have previously proposed an approach for ranking genes based on methylation discordance within their promoter regions, determined by centering a window of fixed size at their transcription start sites. However, we cannot use this method to identify statistically significant genomic features and handle features of variable length and with missing data. Results: We present a new approach for computing the statistical significance of methylation discordance within genomic features of interest in single and multiple test/reference studies. We base the proposed method on a well-articulated hypothesis testing problem that produces p- and q-values for each genomic feature, which we then use to identify and rank features based on the statistical significance of their epigenetic dysregulation. We employ the information-theoretic concept of mutual information to derive a novel test statistic, which we can evaluate by computing Jensen-Shannon distances between the probability distributions of methylation in a test and a reference sample. We design the proposed methodology to simultaneously handle biological, statistical, and technical variability in the data, as well as variable feature lengths and missing data, thus enabling its wide-spread use on any list of genomic features. This is accomplished by estimating, from reference data, the null distribution of the test statistic as a function of feature length using generalized additive regression models. Differential assessment, using normal/cancer data from healthy fetal tissue and pediatric high-grade glioma patients, illustrates the potential of our approach to greatly facilitate the exploratory phases of clinically and biologically relevant methylation studies. Conclusions: The proposed approach provides the first computational tool for statistically testing and ranking genomic features of interest based on observed DNA methylation discordance in comparative studies that accounts, in a rigorous manner, for biological, statistical, and technical variability in methylation data, as well as for variability in feature length and for missing data.
AB - Background: Establishment and maintenance of DNA methylation throughout the genome is an important epigenetic mechanism that regulates gene expression whose disruption has been implicated in human diseases like cancer. It is therefore crucial to know which genes, or other genomic features of interest, exhibit significant discordance in DNA methylation between two phenotypes. We have previously proposed an approach for ranking genes based on methylation discordance within their promoter regions, determined by centering a window of fixed size at their transcription start sites. However, we cannot use this method to identify statistically significant genomic features and handle features of variable length and with missing data. Results: We present a new approach for computing the statistical significance of methylation discordance within genomic features of interest in single and multiple test/reference studies. We base the proposed method on a well-articulated hypothesis testing problem that produces p- and q-values for each genomic feature, which we then use to identify and rank features based on the statistical significance of their epigenetic dysregulation. We employ the information-theoretic concept of mutual information to derive a novel test statistic, which we can evaluate by computing Jensen-Shannon distances between the probability distributions of methylation in a test and a reference sample. We design the proposed methodology to simultaneously handle biological, statistical, and technical variability in the data, as well as variable feature lengths and missing data, thus enabling its wide-spread use on any list of genomic features. This is accomplished by estimating, from reference data, the null distribution of the test statistic as a function of feature length using generalized additive regression models. Differential assessment, using normal/cancer data from healthy fetal tissue and pediatric high-grade glioma patients, illustrates the potential of our approach to greatly facilitate the exploratory phases of clinically and biologically relevant methylation studies. Conclusions: The proposed approach provides the first computational tool for statistically testing and ranking genomic features of interest based on observed DNA methylation discordance in comparative studies that accounts, in a rigorous manner, for biological, statistical, and technical variability in methylation data, as well as for variability in feature length and for missing data.
KW - DNA methylation
KW - Gene ranking
KW - Genomic feature analysis
KW - Information theory
KW - Methylation analysis
KW - Mutual Information
KW - WGBS data analysis
UR - http://www.scopus.com/inward/record.url?scp=85064115968&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064115968&partnerID=8YFLogxK
U2 - 10.1186/s12859-019-2777-6
DO - 10.1186/s12859-019-2777-6
M3 - Article
C2 - 30961526
AN - SCOPUS:85064115968
SN - 1471-2105
VL - 20
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - 1
M1 - 175
ER -