Improving the value of public RNA-seq expression data by phenotype prediction

Shannon E. Ellis; Leonardo Collado-Torres; Andrew Jaffe; Jeffrey T. Leek

doi:10.1093/nar/gky102

Improving the value of public RNA-seq expression data by phenotype prediction

Shannon E. Ellis, Leonardo Collado-Torres, Andrew Jaffe, Jeffrey T. Leek

Bloomberg School of Public Health

Research output: Contribution to journal › Article › peer-review

17 Scopus citations

Abstract

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phe-notypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.

Original language	English (US)
Article number	e54
Journal	Nucleic acids research
Volume	46
Issue number	9
DOIs	https://doi.org/10.1093/nar/gky102
State	Published - May 18 2018

ASJC Scopus subject areas

Genetics

Access to Document

10.1093/nar/gky102

Cite this

@article{c727248ed5844f90a747d07bde7f1a91,

title = "Improving the value of public RNA-seq expression data by phenotype prediction",

abstract = "Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phe-notypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.",

author = "Ellis, {Shannon E.} and Leonardo Collado-Torres and Andrew Jaffe and Leek, {Jeffrey T.}",

note = "Funding Information: We would like to thank Kasper Hansen (Johns Hopkins University) and Abhinav Nellore (Oregon Health and Science University) for helpful discussions and SciServer for hosting the recount2 files. SciServer is being developed at, and administered by, the Institute for Data Intensive Engineering and Science at Johns Hopkins University and is funded by the National Science Foundation Award ACI-1261715. For more information about SciServer, visit http://www.sciserver.org/. National Institutes of Health (NIH) [R01 GM105705]. Funding for open access charge: NIH [R01 GM105705]. Funding Information: We would like to thank Kasper Hansen (Johns Hopkins University) and Abhinav Nellore (Oregon Health and Science University) for helpful discussions and SciServer for hosting the recount2 files. SciServer is being developed at, and administered by, the Institute for Data Intensive Engineering and Science at Johns Hopkins University and is funded by the National Science Foundation Award ACI-1261715. For more information about SciServer, visit http: //www.sciserver.org/. Publisher Copyright: {\textcopyright} The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.",

year = "2018",

month = may,

day = "18",

doi = "10.1093/nar/gky102",

language = "English (US)",

volume = "46",

journal = "Nucleic acids research",

issn = "0305-1048",

publisher = "Oxford University Press",

number = "9",

}

TY - JOUR

T1 - Improving the value of public RNA-seq expression data by phenotype prediction

AU - Ellis, Shannon E.

AU - Collado-Torres, Leonardo

AU - Jaffe, Andrew

AU - Leek, Jeffrey T.

N1 - Funding Information: We would like to thank Kasper Hansen (Johns Hopkins University) and Abhinav Nellore (Oregon Health and Science University) for helpful discussions and SciServer for hosting the recount2 files. SciServer is being developed at, and administered by, the Institute for Data Intensive Engineering and Science at Johns Hopkins University and is funded by the National Science Foundation Award ACI-1261715. For more information about SciServer, visit http://www.sciserver.org/. National Institutes of Health (NIH) [R01 GM105705]. Funding for open access charge: NIH [R01 GM105705]. Funding Information: We would like to thank Kasper Hansen (Johns Hopkins University) and Abhinav Nellore (Oregon Health and Science University) for helpful discussions and SciServer for hosting the recount2 files. SciServer is being developed at, and administered by, the Institute for Data Intensive Engineering and Science at Johns Hopkins University and is funded by the National Science Foundation Award ACI-1261715. For more information about SciServer, visit http: //www.sciserver.org/. Publisher Copyright: © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.

PY - 2018/5/18

Y1 - 2018/5/18

N2 - Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phe-notypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.

AB - Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phe-notypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.

UR - http://www.scopus.com/inward/record.url?scp=85061812937&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85061812937&partnerID=8YFLogxK

U2 - 10.1093/nar/gky102

DO - 10.1093/nar/gky102

M3 - Article

C2 - 29514223

AN - SCOPUS:85061812937

SN - 0305-1048

VL - 46

JO - Nucleic acids research

JF - Nucleic acids research

IS - 9

M1 - e54

ER -

Improving the value of public RNA-seq expression data by phenotype prediction

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this