Svaseq: Removing batch effects and other unwanted noise from sequencing data

Jeffrey T. Leek

doi:10.1093/nar/gku864

Svaseq: Removing batch effects and other unwanted noise from sequencing data

Jeffrey T. Leek

Bloomberg School of Public Health

Research output: Contribution to journal › Article › peer-review

193 Scopus citations

Abstract

It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.

Original language	English (US)
Pages (from-to)	e161
Journal	Nucleic acids research
Volume	42
Issue number	21
DOIs	https://doi.org/10.1093/nar/gku864
State	Published - Dec 1 2014

ASJC Scopus subject areas

Genetics

Access to Document

10.1093/nar/gku864

Cite this

@article{6507ff3c66af46b2abd63c8d6dcb8f94,

title = "Svaseq: Removing batch effects and other unwanted noise from sequencing data",

abstract = "It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.",

author = "Leek, {Jeffrey T.}",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2014.",

year = "2014",

month = dec,

day = "1",

doi = "10.1093/nar/gku864",

language = "English (US)",

volume = "42",

pages = "e161",

journal = "Nucleic acids research",

issn = "0305-1048",

publisher = "Oxford University Press",

number = "21",

}

TY - JOUR

T1 - Svaseq

T2 - Removing batch effects and other unwanted noise from sequencing data

AU - Leek, Jeffrey T.

N1 - Publisher Copyright: © The Author(s) 2014.

PY - 2014/12/1

Y1 - 2014/12/1

N2 - It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.

AB - It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.

UR - http://www.scopus.com/inward/record.url?scp=84925226706&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84925226706&partnerID=8YFLogxK

U2 - 10.1093/nar/gku864

DO - 10.1093/nar/gku864

M3 - Article

C2 - 25294822

AN - SCOPUS:84925226706

SN - 0305-1048

VL - 42

SP - e161

JO - Nucleic acids research

JF - Nucleic acids research

IS - 21

ER -

Svaseq: Removing batch effects and other unwanted noise from sequencing data

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this