Development of dbOGAP

A bioinformatics resource of O-GlcNAcylated proteins and site prediction

Zhang Zhi Hu, Manabu Torii, Jinlian Wang, Hongfang Liu, Gerald Warren Hart

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Protein glycosylation is one of the most common posttranslational modifications (PTMs) with several types. O-GlcNAcylation is an O-linked glycosylation with attachment of β-N-acetylglucosamine (GlcNAc) to Ser/Thr residues catalyzed by O-GlcNAc transferase (OGA), whose removal is catalyzed by O-GlcNAcase. Unlike mucin-type O-glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins and the monosaccharide is not further extended. Moreover, O-GlcNAcylation is dynamic and often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues. Growing evidences suggest that O-GlcNAcylation is very common and has broad roles in physiology as well as in diseases especially through its interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases. In contrast to the enormous body of research on cellular roles of phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small and annotation of O-GlcNAcylated sites in protein databases is currently scarce. An O-GlcNAcylation site prediction program was developed in 2002, but it was based on a small data set with 40 O-GlcNAcylation sites known at that time (http://www.cbs.dtu.dk/services/YinOYang/). Here we seek to develop a database of O-GlcNAcylated proteins and sites, named dbOGAP, and also an O-GlcNAcylated site prediction system based on known sites data in dbOGAP to facilitate annotation and proteomic identification of the O-GlcNAcylation sites. We developed dbOGAP based primarily on O-GlcNAcylated proteins and sites published in peer-reviewed articles dated back to 1984 since it was first described. Most of these proteins were mapped to the UniProtKB protein IDs, except for some that could not be unambiguously mapped. The database currently contains 540 protein entries with experimental O-GlcNAcylation information, and 338 O-GlcNAc sites for 164 proteins. About 59% of these proteins are of humans, and other organisms include rat, mouse, fly and African frog. Among 164 proteins with known O-GlcNAcylation sites, 122 also have both phosphorylation (total 1634) and O-GlcNAc sites (total 263). The Gene Ontology (GO) profiling showed that the known O-GlcNAcylated proteins have a broad range of functions including developmental process, transcriptional regulation, cell signaling, metabolic regulation, and cellular transport and trafficking. The GO profile also showed that O-GlcNAcylated proteins are primarily nuclear and cytoplasmic, including membrane-associated intracellular proteins. The database is also populated with additional orthologous protein sequences to known O-GlcNAcylated proteins. Additional functional data, including other PTM features, biological pathways and disease information have been integrated to the database. We developed an O-GlcNAcylation site prediction program using Support Vector Machine (SVM). As positive instances, sequence fragments surrounding 322 O-GlcNAcylated Ser/Thr sites were extracted from 157 proteins in dbOGAP, and over 28 thousand sequence fragments surrounding the rest of the Ser/Thr sites in those proteins were assumed as negative instances. Two thirds of this data set was randomly selected as development data and was used for tuning parameters in SVM classifiers, while the rest of the data was set apart as a held-out test data set. To reduce the impact of imbalanced data on the performance of trained classifiers, we explored different ratio of positive to negative instances in a training data set, which was controlled by under-sampling negative instances in a training data set. The optimal parameters of the prediction system were sought in five-fold cross-validation tests conducted on the development data set, and the final classifier trained on the entire development data set was evaluated on the held-out test data set. We used four encoding methods for feature vector extraction, including binary encoding, composition of k-spaced amino acid pairs (CKSAAP), monomer spectrum (MS), and composition of monomer spectrum (CMS). These encoding methods yielded different prediction performance for O-GlcNAcylation sites. The results showed that the method obtained an AUC (area under curve) of ∼80% on the test sequence set. The dbOGAP database and the O-GlcNAcylation site prediction program tool are being made web accessible and the web resource will be an important bioinformatics tool to facilitate exploration of the broad roles of O-GlcNAcylation in physiology and diseases.

Original languageEnglish (US)
Title of host publicationProceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009
Pages346
Number of pages1
DOIs
StatePublished - 2009
Event2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009 - Washington, DC, United States
Duration: Nov 1 2009Nov 4 2009

Other

Other2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009
CountryUnited States
CityWashington, DC
Period11/1/0911/4/09

Fingerprint

Bioinformatics
Computational Biology
Proteins
Databases
Phosphorylation
Glycosylation
Gene Ontology
Post Translational Protein Processing
Classifiers
Physiology
Datasets
Support vector machines
Ontology
Protein Databases
Acetylglucosamine
Monosaccharides
Genes
Monomers
Mucins
Nuclear Proteins

Keywords

  • Database
  • O-GlcNAcylation
  • Protein glycosylation
  • Site prediction
  • Support vector machine

ASJC Scopus subject areas

  • Biomedical Engineering
  • Health Informatics
  • Health Information Management

Cite this

Hu, Z. Z., Torii, M., Wang, J., Liu, H., & Hart, G. W. (2009). Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction. In Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009 (pp. 346). [5332094] https://doi.org/10.1109/BIBMW.2009.5332094

Development of dbOGAP : A bioinformatics resource of O-GlcNAcylated proteins and site prediction. / Hu, Zhang Zhi; Torii, Manabu; Wang, Jinlian; Liu, Hongfang; Hart, Gerald Warren.

Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009. 2009. p. 346 5332094.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hu, ZZ, Torii, M, Wang, J, Liu, H & Hart, GW 2009, Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction. in Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009., 5332094, pp. 346, 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009, Washington, DC, United States, 11/1/09. https://doi.org/10.1109/BIBMW.2009.5332094
Hu ZZ, Torii M, Wang J, Liu H, Hart GW. Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction. In Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009. 2009. p. 346. 5332094 https://doi.org/10.1109/BIBMW.2009.5332094
Hu, Zhang Zhi ; Torii, Manabu ; Wang, Jinlian ; Liu, Hongfang ; Hart, Gerald Warren. / Development of dbOGAP : A bioinformatics resource of O-GlcNAcylated proteins and site prediction. Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009. 2009. pp. 346
@inproceedings{c38dd61062d04e249e39d7fcabf7c4ff,
title = "Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction",
abstract = "Protein glycosylation is one of the most common posttranslational modifications (PTMs) with several types. O-GlcNAcylation is an O-linked glycosylation with attachment of β-N-acetylglucosamine (GlcNAc) to Ser/Thr residues catalyzed by O-GlcNAc transferase (OGA), whose removal is catalyzed by O-GlcNAcase. Unlike mucin-type O-glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins and the monosaccharide is not further extended. Moreover, O-GlcNAcylation is dynamic and often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues. Growing evidences suggest that O-GlcNAcylation is very common and has broad roles in physiology as well as in diseases especially through its interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases. In contrast to the enormous body of research on cellular roles of phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small and annotation of O-GlcNAcylated sites in protein databases is currently scarce. An O-GlcNAcylation site prediction program was developed in 2002, but it was based on a small data set with 40 O-GlcNAcylation sites known at that time (http://www.cbs.dtu.dk/services/YinOYang/). Here we seek to develop a database of O-GlcNAcylated proteins and sites, named dbOGAP, and also an O-GlcNAcylated site prediction system based on known sites data in dbOGAP to facilitate annotation and proteomic identification of the O-GlcNAcylation sites. We developed dbOGAP based primarily on O-GlcNAcylated proteins and sites published in peer-reviewed articles dated back to 1984 since it was first described. Most of these proteins were mapped to the UniProtKB protein IDs, except for some that could not be unambiguously mapped. The database currently contains 540 protein entries with experimental O-GlcNAcylation information, and 338 O-GlcNAc sites for 164 proteins. About 59{\%} of these proteins are of humans, and other organisms include rat, mouse, fly and African frog. Among 164 proteins with known O-GlcNAcylation sites, 122 also have both phosphorylation (total 1634) and O-GlcNAc sites (total 263). The Gene Ontology (GO) profiling showed that the known O-GlcNAcylated proteins have a broad range of functions including developmental process, transcriptional regulation, cell signaling, metabolic regulation, and cellular transport and trafficking. The GO profile also showed that O-GlcNAcylated proteins are primarily nuclear and cytoplasmic, including membrane-associated intracellular proteins. The database is also populated with additional orthologous protein sequences to known O-GlcNAcylated proteins. Additional functional data, including other PTM features, biological pathways and disease information have been integrated to the database. We developed an O-GlcNAcylation site prediction program using Support Vector Machine (SVM). As positive instances, sequence fragments surrounding 322 O-GlcNAcylated Ser/Thr sites were extracted from 157 proteins in dbOGAP, and over 28 thousand sequence fragments surrounding the rest of the Ser/Thr sites in those proteins were assumed as negative instances. Two thirds of this data set was randomly selected as development data and was used for tuning parameters in SVM classifiers, while the rest of the data was set apart as a held-out test data set. To reduce the impact of imbalanced data on the performance of trained classifiers, we explored different ratio of positive to negative instances in a training data set, which was controlled by under-sampling negative instances in a training data set. The optimal parameters of the prediction system were sought in five-fold cross-validation tests conducted on the development data set, and the final classifier trained on the entire development data set was evaluated on the held-out test data set. We used four encoding methods for feature vector extraction, including binary encoding, composition of k-spaced amino acid pairs (CKSAAP), monomer spectrum (MS), and composition of monomer spectrum (CMS). These encoding methods yielded different prediction performance for O-GlcNAcylation sites. The results showed that the method obtained an AUC (area under curve) of ∼80{\%} on the test sequence set. The dbOGAP database and the O-GlcNAcylation site prediction program tool are being made web accessible and the web resource will be an important bioinformatics tool to facilitate exploration of the broad roles of O-GlcNAcylation in physiology and diseases.",
keywords = "Database, O-GlcNAcylation, Protein glycosylation, Site prediction, Support vector machine",
author = "Hu, {Zhang Zhi} and Manabu Torii and Jinlian Wang and Hongfang Liu and Hart, {Gerald Warren}",
year = "2009",
doi = "10.1109/BIBMW.2009.5332094",
language = "English (US)",
isbn = "9781424451210",
pages = "346",
booktitle = "Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009",

}

TY - GEN

T1 - Development of dbOGAP

T2 - A bioinformatics resource of O-GlcNAcylated proteins and site prediction

AU - Hu, Zhang Zhi

AU - Torii, Manabu

AU - Wang, Jinlian

AU - Liu, Hongfang

AU - Hart, Gerald Warren

PY - 2009

Y1 - 2009

N2 - Protein glycosylation is one of the most common posttranslational modifications (PTMs) with several types. O-GlcNAcylation is an O-linked glycosylation with attachment of β-N-acetylglucosamine (GlcNAc) to Ser/Thr residues catalyzed by O-GlcNAc transferase (OGA), whose removal is catalyzed by O-GlcNAcase. Unlike mucin-type O-glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins and the monosaccharide is not further extended. Moreover, O-GlcNAcylation is dynamic and often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues. Growing evidences suggest that O-GlcNAcylation is very common and has broad roles in physiology as well as in diseases especially through its interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases. In contrast to the enormous body of research on cellular roles of phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small and annotation of O-GlcNAcylated sites in protein databases is currently scarce. An O-GlcNAcylation site prediction program was developed in 2002, but it was based on a small data set with 40 O-GlcNAcylation sites known at that time (http://www.cbs.dtu.dk/services/YinOYang/). Here we seek to develop a database of O-GlcNAcylated proteins and sites, named dbOGAP, and also an O-GlcNAcylated site prediction system based on known sites data in dbOGAP to facilitate annotation and proteomic identification of the O-GlcNAcylation sites. We developed dbOGAP based primarily on O-GlcNAcylated proteins and sites published in peer-reviewed articles dated back to 1984 since it was first described. Most of these proteins were mapped to the UniProtKB protein IDs, except for some that could not be unambiguously mapped. The database currently contains 540 protein entries with experimental O-GlcNAcylation information, and 338 O-GlcNAc sites for 164 proteins. About 59% of these proteins are of humans, and other organisms include rat, mouse, fly and African frog. Among 164 proteins with known O-GlcNAcylation sites, 122 also have both phosphorylation (total 1634) and O-GlcNAc sites (total 263). The Gene Ontology (GO) profiling showed that the known O-GlcNAcylated proteins have a broad range of functions including developmental process, transcriptional regulation, cell signaling, metabolic regulation, and cellular transport and trafficking. The GO profile also showed that O-GlcNAcylated proteins are primarily nuclear and cytoplasmic, including membrane-associated intracellular proteins. The database is also populated with additional orthologous protein sequences to known O-GlcNAcylated proteins. Additional functional data, including other PTM features, biological pathways and disease information have been integrated to the database. We developed an O-GlcNAcylation site prediction program using Support Vector Machine (SVM). As positive instances, sequence fragments surrounding 322 O-GlcNAcylated Ser/Thr sites were extracted from 157 proteins in dbOGAP, and over 28 thousand sequence fragments surrounding the rest of the Ser/Thr sites in those proteins were assumed as negative instances. Two thirds of this data set was randomly selected as development data and was used for tuning parameters in SVM classifiers, while the rest of the data was set apart as a held-out test data set. To reduce the impact of imbalanced data on the performance of trained classifiers, we explored different ratio of positive to negative instances in a training data set, which was controlled by under-sampling negative instances in a training data set. The optimal parameters of the prediction system were sought in five-fold cross-validation tests conducted on the development data set, and the final classifier trained on the entire development data set was evaluated on the held-out test data set. We used four encoding methods for feature vector extraction, including binary encoding, composition of k-spaced amino acid pairs (CKSAAP), monomer spectrum (MS), and composition of monomer spectrum (CMS). These encoding methods yielded different prediction performance for O-GlcNAcylation sites. The results showed that the method obtained an AUC (area under curve) of ∼80% on the test sequence set. The dbOGAP database and the O-GlcNAcylation site prediction program tool are being made web accessible and the web resource will be an important bioinformatics tool to facilitate exploration of the broad roles of O-GlcNAcylation in physiology and diseases.

AB - Protein glycosylation is one of the most common posttranslational modifications (PTMs) with several types. O-GlcNAcylation is an O-linked glycosylation with attachment of β-N-acetylglucosamine (GlcNAc) to Ser/Thr residues catalyzed by O-GlcNAc transferase (OGA), whose removal is catalyzed by O-GlcNAcase. Unlike mucin-type O-glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins and the monosaccharide is not further extended. Moreover, O-GlcNAcylation is dynamic and often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues. Growing evidences suggest that O-GlcNAcylation is very common and has broad roles in physiology as well as in diseases especially through its interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases. In contrast to the enormous body of research on cellular roles of phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small and annotation of O-GlcNAcylated sites in protein databases is currently scarce. An O-GlcNAcylation site prediction program was developed in 2002, but it was based on a small data set with 40 O-GlcNAcylation sites known at that time (http://www.cbs.dtu.dk/services/YinOYang/). Here we seek to develop a database of O-GlcNAcylated proteins and sites, named dbOGAP, and also an O-GlcNAcylated site prediction system based on known sites data in dbOGAP to facilitate annotation and proteomic identification of the O-GlcNAcylation sites. We developed dbOGAP based primarily on O-GlcNAcylated proteins and sites published in peer-reviewed articles dated back to 1984 since it was first described. Most of these proteins were mapped to the UniProtKB protein IDs, except for some that could not be unambiguously mapped. The database currently contains 540 protein entries with experimental O-GlcNAcylation information, and 338 O-GlcNAc sites for 164 proteins. About 59% of these proteins are of humans, and other organisms include rat, mouse, fly and African frog. Among 164 proteins with known O-GlcNAcylation sites, 122 also have both phosphorylation (total 1634) and O-GlcNAc sites (total 263). The Gene Ontology (GO) profiling showed that the known O-GlcNAcylated proteins have a broad range of functions including developmental process, transcriptional regulation, cell signaling, metabolic regulation, and cellular transport and trafficking. The GO profile also showed that O-GlcNAcylated proteins are primarily nuclear and cytoplasmic, including membrane-associated intracellular proteins. The database is also populated with additional orthologous protein sequences to known O-GlcNAcylated proteins. Additional functional data, including other PTM features, biological pathways and disease information have been integrated to the database. We developed an O-GlcNAcylation site prediction program using Support Vector Machine (SVM). As positive instances, sequence fragments surrounding 322 O-GlcNAcylated Ser/Thr sites were extracted from 157 proteins in dbOGAP, and over 28 thousand sequence fragments surrounding the rest of the Ser/Thr sites in those proteins were assumed as negative instances. Two thirds of this data set was randomly selected as development data and was used for tuning parameters in SVM classifiers, while the rest of the data was set apart as a held-out test data set. To reduce the impact of imbalanced data on the performance of trained classifiers, we explored different ratio of positive to negative instances in a training data set, which was controlled by under-sampling negative instances in a training data set. The optimal parameters of the prediction system were sought in five-fold cross-validation tests conducted on the development data set, and the final classifier trained on the entire development data set was evaluated on the held-out test data set. We used four encoding methods for feature vector extraction, including binary encoding, composition of k-spaced amino acid pairs (CKSAAP), monomer spectrum (MS), and composition of monomer spectrum (CMS). These encoding methods yielded different prediction performance for O-GlcNAcylation sites. The results showed that the method obtained an AUC (area under curve) of ∼80% on the test sequence set. The dbOGAP database and the O-GlcNAcylation site prediction program tool are being made web accessible and the web resource will be an important bioinformatics tool to facilitate exploration of the broad roles of O-GlcNAcylation in physiology and diseases.

KW - Database

KW - O-GlcNAcylation

KW - Protein glycosylation

KW - Site prediction

KW - Support vector machine

UR - http://www.scopus.com/inward/record.url?scp=72849130477&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=72849130477&partnerID=8YFLogxK

U2 - 10.1109/BIBMW.2009.5332094

DO - 10.1109/BIBMW.2009.5332094

M3 - Conference contribution

SN - 9781424451210

SP - 346

BT - Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009

ER -