TY - JOUR
T1 - Universal seeds for cDNA-to-genome comparison
AU - Zhou, Leming
AU - Stanton, Jonathan
AU - Florea, Liliana
N1 - Funding Information:
Sensitivity calculations were performed on the "Herd" Scientific Computing Cluster at The George Washington University (NSF grant CLS20163A to JS). This work was supported in part by a Sloan Research Fellowship to LF.
PY - 2008/1/23
Y1 - 2008/1/23
N2 - Background: To meet the needs of gene annotation for newly sequenced organisms, optimized spaced seeds can be implemented into cross-species sequence alignment programs to accurately align gene sequences to the genome of a related species. So far, seed performance has been tested for comparisons between closely related species, such as human and mouse, or on simulated data. As the number and variety of genomes increases, it becomes desirable to identify a small set of universal seeds that perform optimally or near-optimally on a large range of comparisons. Results: Using statistical regression methods, we investigate the sensitivity of seeds, in particular good seeds, between four cDNA-to-genome comparisons at different evolutionary distances (human-dog, human-mouse, human-chicken and human-zebrafish), and identify classes of comparisons that show similar seed behavior and therefore can employ the same seed. In addition, we find that with high confidence good seeds for more distant comparisons perform well on closer comparisons, within 98-99% of the optimal seeds, and thus represent universal good seeds. Conclusion: We show for the first time that optimal and near-optimal seeds for distant species-to-species comparisons are more generally applicable to a wide range of comparisons. This finding will be instrumental in developing practical and user-friendly cDNA-to-genome alignment applications, to aid in the annotation of new model organisms.
AB - Background: To meet the needs of gene annotation for newly sequenced organisms, optimized spaced seeds can be implemented into cross-species sequence alignment programs to accurately align gene sequences to the genome of a related species. So far, seed performance has been tested for comparisons between closely related species, such as human and mouse, or on simulated data. As the number and variety of genomes increases, it becomes desirable to identify a small set of universal seeds that perform optimally or near-optimally on a large range of comparisons. Results: Using statistical regression methods, we investigate the sensitivity of seeds, in particular good seeds, between four cDNA-to-genome comparisons at different evolutionary distances (human-dog, human-mouse, human-chicken and human-zebrafish), and identify classes of comparisons that show similar seed behavior and therefore can employ the same seed. In addition, we find that with high confidence good seeds for more distant comparisons perform well on closer comparisons, within 98-99% of the optimal seeds, and thus represent universal good seeds. Conclusion: We show for the first time that optimal and near-optimal seeds for distant species-to-species comparisons are more generally applicable to a wide range of comparisons. This finding will be instrumental in developing practical and user-friendly cDNA-to-genome alignment applications, to aid in the annotation of new model organisms.
UR - http://www.scopus.com/inward/record.url?scp=41049093450&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=41049093450&partnerID=8YFLogxK
U2 - 10.1186/1471-2105-9-36
DO - 10.1186/1471-2105-9-36
M3 - Article
C2 - 18215286
AN - SCOPUS:41049093450
SN - 1471-2105
VL - 9
JO - BMC Bioinformatics
JF - BMC Bioinformatics
M1 - 36
ER -