A clustering method for repeat analysis in DNA sequences

Natalia Volfovsky; Brian J. Haas; Steven L. Salzberg

doi:10.1186/gb-2001-2-8-research0027

A clustering method for repeat analysis in DNA sequences

Natalia Volfovsky, Brian J. Haas, Steven L. Salzberg

Research output: Contribution to journal › Article › peer-review

101 Scopus citations

Abstract

Background: A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. Results: The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. Conclusions: We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.

Original language	English (US)
Journal	Genome biology
Volume	2
Issue number	8
DOIs	https://doi.org/10.1186/gb-2001-2-8-research0027
State	Published - Aug 2001
Externally published	Yes

Keywords

Bacterial Artificial Chromosome
Merging Procedure
Partition Point
Repeat Class
Suffix Tree

ASJC Scopus subject areas

Genetics
Ecology, Evolution, Behavior and Systematics
Cell Biology

Access to Document

10.1186/gb-2001-2-8-research0027

Cite this

@article{0c216663d8a143d6a9308a00400af3f9,

title = "A clustering method for repeat analysis in DNA sequences",

abstract = "Background: A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. Results: The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. Conclusions: We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.",

keywords = "Bacterial Artificial Chromosome, Merging Procedure, Partition Point, Repeat Class, Suffix Tree",

author = "Natalia Volfovsky and Haas, {Brian J.} and Salzberg, {Steven L.}",

note = "Publisher Copyright: {\textcopyright} 2001, Volfovsky et al., licensee BioMed Central Ltd.",

year = "2001",

month = aug,

doi = "10.1186/gb-2001-2-8-research0027",

language = "English (US)",

volume = "2",

journal = "Genome biology",

issn = "1474-7596",

publisher = "BioMed Central",

number = "8",

}

TY - JOUR

T1 - A clustering method for repeat analysis in DNA sequences

AU - Volfovsky, Natalia

AU - Haas, Brian J.

AU - Salzberg, Steven L.

PY - 2001/8

Y1 - 2001/8

N2 - Background: A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. Results: The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. Conclusions: We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.

AB - Background: A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. Results: The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. Conclusions: We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.

KW - Bacterial Artificial Chromosome

KW - Merging Procedure

KW - Partition Point

KW - Repeat Class

KW - Suffix Tree

UR - http://www.scopus.com/inward/record.url?scp=0035230042&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035230042&partnerID=8YFLogxK

U2 - 10.1186/gb-2001-2-8-research0027

DO - 10.1186/gb-2001-2-8-research0027

M3 - Article

C2 - 11532211

AN - SCOPUS:0035230042

SN - 1474-7596

VL - 2

JO - Genome biology

JF - Genome biology

IS - 8

ER -

A clustering method for repeat analysis in DNA sequences

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this