Identifying overrepresented concepts in gene lists from literature

A statistical approach based on Poisson mixture model

Xin He, Moushumi S. Sarma, Xu Ling, Brant Chee, Chengxiang Zhai, Bruce Schatz

Research output: Contribution to journalArticle

Abstract

Background: Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.Results: We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.Conclusions: We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.

Original languageEnglish (US)
Article number272
JournalBMC Bioinformatics
Volume11
DOIs
StatePublished - May 20 2010
Externally publishedYes

Fingerprint

Poisson Mixture
Poisson Model
Mixture Model
Genes
Gene
Molecular Sequence Annotation
Gene Ontology
Controlled Vocabulary
Genomics
Annotation
Vocabulary
Ontology
Concepts
Thesauri
Experimentation
Statistical method
Mining
Sharing
Complement
Statistical methods

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Identifying overrepresented concepts in gene lists from literature : A statistical approach based on Poisson mixture model. / He, Xin; Sarma, Moushumi S.; Ling, Xu; Chee, Brant; Zhai, Chengxiang; Schatz, Bruce.

In: BMC Bioinformatics, Vol. 11, 272, 20.05.2010.

Research output: Contribution to journalArticle

He, Xin ; Sarma, Moushumi S. ; Ling, Xu ; Chee, Brant ; Zhai, Chengxiang ; Schatz, Bruce. / Identifying overrepresented concepts in gene lists from literature : A statistical approach based on Poisson mixture model. In: BMC Bioinformatics. 2010 ; Vol. 11.
@article{9f774a25970949aeaf26a45a8be0e435,
title = "Identifying overrepresented concepts in gene lists from literature: A statistical approach based on Poisson mixture model",
abstract = "Background: Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.Results: We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.Conclusions: We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.",
author = "Xin He and Sarma, {Moushumi S.} and Xu Ling and Brant Chee and Chengxiang Zhai and Bruce Schatz",
year = "2010",
month = "5",
day = "20",
doi = "10.1186/1471-2105-11-272",
language = "English (US)",
volume = "11",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Identifying overrepresented concepts in gene lists from literature

T2 - A statistical approach based on Poisson mixture model

AU - He, Xin

AU - Sarma, Moushumi S.

AU - Ling, Xu

AU - Chee, Brant

AU - Zhai, Chengxiang

AU - Schatz, Bruce

PY - 2010/5/20

Y1 - 2010/5/20

N2 - Background: Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.Results: We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.Conclusions: We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.

AB - Background: Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.Results: We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.Conclusions: We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.

UR - http://www.scopus.com/inward/record.url?scp=77953993713&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77953993713&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-11-272

DO - 10.1186/1471-2105-11-272

M3 - Article

VL - 11

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 272

ER -