Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation

Albi Celaj; Janet Markle; Jayne Danska; John Parkinson

doi:10.1186/2049-2618-2-39

Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation

Albi Celaj, Janet Markle, Jayne Danska, John Parkinson

Research output: Contribution to journal › Article › peer-review

30 Scopus citations

Abstract

Background: Microbiome-wide gene expression profiling through high-throughput RNA sequencing ('metatranscriptomics') offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences ('contigs'), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT.Results: We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses.Conclusion: Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly.

Original language	English (US)
Article number	39
Journal	Microbiome
Volume	2
Issue number	1
DOIs	https://doi.org/10.1186/2049-2618-2-39
State	Published - Oct 28 2014
Externally published	Yes

Keywords

Bioinformatics
Metatranscriptomics
Microbiome
RNA sequencing
Sequence assembly

ASJC Scopus subject areas

Microbiology
Microbiology (medical)

Access to Document

10.1186/2049-2618-2-39

Cite this

@article{b3e0639921a7400a81c44cf1b50fd8d6,

title = "Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation",

abstract = "Background: Microbiome-wide gene expression profiling through high-throughput RNA sequencing ('metatranscriptomics') offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences ('contigs'), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT.Results: We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses.Conclusion: Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly.",

keywords = "Bioinformatics, Metatranscriptomics, Microbiome, RNA sequencing, Sequence assembly",

author = "Albi Celaj and Janet Markle and Jayne Danska and John Parkinson",

note = "Publisher Copyright: {\textcopyright} 2014 Celaj et al.",

year = "2014",

month = oct,

day = "28",

doi = "10.1186/2049-2618-2-39",

language = "English (US)",

volume = "2",

journal = "Microbiome",

issn = "2049-2618",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation

AU - Celaj, Albi

AU - Markle, Janet

AU - Danska, Jayne

AU - Parkinson, John

PY - 2014/10/28

Y1 - 2014/10/28

N2 - Background: Microbiome-wide gene expression profiling through high-throughput RNA sequencing ('metatranscriptomics') offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences ('contigs'), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT.Results: We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses.Conclusion: Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly.

AB - Background: Microbiome-wide gene expression profiling through high-throughput RNA sequencing ('metatranscriptomics') offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences ('contigs'), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT.Results: We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses.Conclusion: Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly.

KW - Bioinformatics

KW - Metatranscriptomics

KW - Microbiome

KW - RNA sequencing

KW - Sequence assembly

UR - http://www.scopus.com/inward/record.url?scp=84938485320&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84938485320&partnerID=8YFLogxK

U2 - 10.1186/2049-2618-2-39

DO - 10.1186/2049-2618-2-39

M3 - Article

AN - SCOPUS:84938485320

SN - 2049-2618

VL - 2

JO - Microbiome

JF - Microbiome

IS - 1

M1 - 39

ER -

Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this