Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis

Andrew E. Jaffe; Thomas Hyde; Joel Kleinman; Daniel R. Weinbergern; Joshua G. Chenoweth; Ronald D. McKay; Jeffrey T. Leek; Carlo Colantuoni

doi:10.1186/s12859-015-0808-5

Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis

Andrew E. Jaffe, Thomas Hyde, Joel Kleinman, Daniel R. Weinbergern, Joshua G. Chenoweth, Ronald D. McKay, Jeffrey T. Leek, Carlo Colantuoni

Research output: Contribution to journal › Article › peer-review

26 Scopus citations

Abstract

Background: Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature. Methods: We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272). Results: Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the "cleaned" data, including sex, common copy number effects and sample or cell line-specific molecular behavior. Conclusions: Our analyses indicate that data "cleaning" can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised "cleaning", because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding "cleaning" process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/and GSE30272.

Original language	English (US)
Article number	372
Journal	BMC Bioinformatics
Volume	16
Issue number	1
DOIs	https://doi.org/10.1186/s12859-015-0808-5
State	Published - Nov 6 2015

Keywords

Batch correction
Gene expression
Surrogate variable analysis

ASJC Scopus subject areas

Structural Biology
Biochemistry
Molecular Biology
Computer Science Applications
Applied Mathematics

Access to Document

10.1186/s12859-015-0808-5

Cite this

@article{afacfaad9cf24561be9cde8b50167588,

title = "Practical impacts of genomic data {"}cleaning{"} on biological discovery using surrogate variable analysis",

abstract = "Background: Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of {"}batch{"} correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature. Methods: We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272). Results: Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the {"}cleaned{"} data, including sex, common copy number effects and sample or cell line-specific molecular behavior. Conclusions: Our analyses indicate that data {"}cleaning{"} can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised {"}cleaning{"}, because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding {"}cleaning{"} process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/and GSE30272.",

keywords = "Batch correction, Gene expression, Surrogate variable analysis",

author = "Jaffe, {Andrew E.} and Thomas Hyde and Joel Kleinman and Weinbergern, {Daniel R.} and Chenoweth, {Joshua G.} and McKay, {Ronald D.} and Leek, {Jeffrey T.} and Carlo Colantuoni",

note = "Publisher Copyright: {\textcopyright} 2015 Jaffe et al.",

year = "2015",

month = nov,

day = "6",

doi = "10.1186/s12859-015-0808-5",

language = "English (US)",

volume = "16",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis

AU - Jaffe, Andrew E.

AU - Hyde, Thomas

AU - Kleinman, Joel

AU - Weinbergern, Daniel R.

AU - Chenoweth, Joshua G.

AU - McKay, Ronald D.

AU - Leek, Jeffrey T.

AU - Colantuoni, Carlo

PY - 2015/11/6

Y1 - 2015/11/6

N2 - Background: Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature. Methods: We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272). Results: Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the "cleaned" data, including sex, common copy number effects and sample or cell line-specific molecular behavior. Conclusions: Our analyses indicate that data "cleaning" can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised "cleaning", because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding "cleaning" process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/and GSE30272.

AB - Background: Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature. Methods: We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272). Results: Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the "cleaned" data, including sex, common copy number effects and sample or cell line-specific molecular behavior. Conclusions: Our analyses indicate that data "cleaning" can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised "cleaning", because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding "cleaning" process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/and GSE30272.

KW - Batch correction

KW - Gene expression

KW - Surrogate variable analysis

UR - http://www.scopus.com/inward/record.url?scp=84946411824&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84946411824&partnerID=8YFLogxK

U2 - 10.1186/s12859-015-0808-5

DO - 10.1186/s12859-015-0808-5

M3 - Article

C2 - 26545828

AN - SCOPUS:84946411824

SN - 1471-2105

VL - 16

JO - BMC Bioinformatics

JF - BMC Bioinformatics

IS - 1

M1 - 372

ER -

Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this