Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer

Silvia Pineda, Francisco X. Real, Manolis Kogevinas, Alfredo Carrato, Stephen J. Chanock, Núria Malats, Kristel Van Steen

Research output: Contribution to journalArticle

Abstract

Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75%) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions.

Original languageEnglish (US)
Article numbere1005689
JournalPLoS Genetics
Volume11
Issue number12
DOIs
StatePublished - 2015
Externally publishedYes

Fingerprint

application methods
Urinary Bladder Neoplasms
Single Nucleotide Polymorphism
cancer
probes (equipment)
gene expression
probe
gene
methylation
DNA methylation
Biological Phenomena
Gene Expression
methodology
tumor
biological processes
DNA Methylation
genomics
genes
Genes
DNA

ASJC Scopus subject areas

  • Genetics
  • Molecular Biology
  • Ecology, Evolution, Behavior and Systematics
  • Cancer Research
  • Genetics(clinical)

Cite this

Pineda, S., Real, F. X., Kogevinas, M., Carrato, A., Chanock, S. J., Malats, N., & Van Steen, K. (2015). Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer. PLoS Genetics, 11(12), [e1005689]. https://doi.org/10.1371/journal.pgen.1005689

Integration Analysis of Three Omics Data Using Penalized Regression Methods : An Application to Bladder Cancer. / Pineda, Silvia; Real, Francisco X.; Kogevinas, Manolis; Carrato, Alfredo; Chanock, Stephen J.; Malats, Núria; Van Steen, Kristel.

In: PLoS Genetics, Vol. 11, No. 12, e1005689, 2015.

Research output: Contribution to journalArticle

Pineda, S, Real, FX, Kogevinas, M, Carrato, A, Chanock, SJ, Malats, N & Van Steen, K 2015, 'Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer', PLoS Genetics, vol. 11, no. 12, e1005689. https://doi.org/10.1371/journal.pgen.1005689
Pineda, Silvia ; Real, Francisco X. ; Kogevinas, Manolis ; Carrato, Alfredo ; Chanock, Stephen J. ; Malats, Núria ; Van Steen, Kristel. / Integration Analysis of Three Omics Data Using Penalized Regression Methods : An Application to Bladder Cancer. In: PLoS Genetics. 2015 ; Vol. 11, No. 12.
@article{b53f08acb5fa4cb4a536b118b6a4e384,
title = "Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer",
abstract = "Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75{\%}) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions.",
author = "Silvia Pineda and Real, {Francisco X.} and Manolis Kogevinas and Alfredo Carrato and Chanock, {Stephen J.} and N{\'u}ria Malats and {Van Steen}, Kristel",
year = "2015",
doi = "10.1371/journal.pgen.1005689",
language = "English (US)",
volume = "11",
journal = "PLoS Genetics",
issn = "1553-7390",
publisher = "Public Library of Science",
number = "12",

}

TY - JOUR

T1 - Integration Analysis of Three Omics Data Using Penalized Regression Methods

T2 - An Application to Bladder Cancer

AU - Pineda, Silvia

AU - Real, Francisco X.

AU - Kogevinas, Manolis

AU - Carrato, Alfredo

AU - Chanock, Stephen J.

AU - Malats, Núria

AU - Van Steen, Kristel

PY - 2015

Y1 - 2015

N2 - Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75%) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions.

AB - Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75%) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions.

UR - http://www.scopus.com/inward/record.url?scp=84953313516&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84953313516&partnerID=8YFLogxK

U2 - 10.1371/journal.pgen.1005689

DO - 10.1371/journal.pgen.1005689

M3 - Article

C2 - 26646822

AN - SCOPUS:84953313516

VL - 11

JO - PLoS Genetics

JF - PLoS Genetics

SN - 1553-7390

IS - 12

M1 - e1005689

ER -