Protein family and fold occurrence in genomes

Power-law behaviour and evolutionary model

Jiang Qian, Nicholas M. Luscombe, Mark Gerstein

Research output: Contribution to journalArticle

Abstract

Global surveys of genomes measure the usage of essential molecular parts, defined here as protein families, superfamilies or folds, in different organisms. Based on surveys of the first 20 completely sequenced genomes, we observe that the occurrence of these parts follows a power-law distribution. That is, the number of distinct parts (F) with a given genomic occurrence (V) decays as F = aV-b, with a few parts occurring many times and most occurring infrequently. For a given organism, the distributions of families, superfamilies and folds are nearly identical, and this is reflected in the size of the decay exponent b. Moreover, the exponent varies between different organisms, with those of smaller genomes displaying a steeper decay (i.e. larger b). Clearly, the power law indicates a preference to duplicate genes that encode for molecular parts which are already common. Here, we present a minimal, but biologically meaningful model that accurately describes the observed power law. Although the model performs equally well for all three protein classes, we focus on the occurrence of folds in preference to families and superfamilies. This is because folds are comparatively insensitive to the effects of point mutations that can cause a family member to diverge beyond detectable similarity. In the model, genomes evolve through two basic operations: (i) duplication of existing genes; (ii) net flow of new genes. The flow term is closely related to the exponent b and can accommodate considerable gene loss; however, we demonstrate that the observed data is reproduced best with a net inflow, i.e. with more gene gain than loss. Moreover, we show that prokaryotes have much higher rates of gene acquisition than eukaryotes, probably reflecting lateral transfer. A further natural outcome from our model is an estimation of the fold composition of the initial genome, which potentially relates to the common ancestor for modern organisms. Supplementary material pertaining to this work is available from www.partslist.org/powerlaw.

Original languageEnglish (US)
Pages (from-to)673-681
Number of pages9
JournalJournal of Molecular Biology
Volume313
Issue number4
DOIs
StatePublished - Nov 2 2001
Externally publishedYes

Fingerprint

Genome
Proteins
Duplicate Genes
Genes
Gene Duplication
Gene Flow
Eukaryota
Point Mutation
Power (Psychology)
Surveys and Questionnaires

Keywords

  • Bioinformatics
  • Evolution
  • Genomics
  • Power law
  • Protein families
  • Protein folds
  • Protein superfamilies
  • Proteomics

ASJC Scopus subject areas

  • Virology

Cite this

Protein family and fold occurrence in genomes : Power-law behaviour and evolutionary model. / Qian, Jiang; Luscombe, Nicholas M.; Gerstein, Mark.

In: Journal of Molecular Biology, Vol. 313, No. 4, 02.11.2001, p. 673-681.

Research output: Contribution to journalArticle

Qian, Jiang ; Luscombe, Nicholas M. ; Gerstein, Mark. / Protein family and fold occurrence in genomes : Power-law behaviour and evolutionary model. In: Journal of Molecular Biology. 2001 ; Vol. 313, No. 4. pp. 673-681.
@article{58429ba6654241688f262f2eec10713b,
title = "Protein family and fold occurrence in genomes: Power-law behaviour and evolutionary model",
abstract = "Global surveys of genomes measure the usage of essential molecular parts, defined here as protein families, superfamilies or folds, in different organisms. Based on surveys of the first 20 completely sequenced genomes, we observe that the occurrence of these parts follows a power-law distribution. That is, the number of distinct parts (F) with a given genomic occurrence (V) decays as F = aV-b, with a few parts occurring many times and most occurring infrequently. For a given organism, the distributions of families, superfamilies and folds are nearly identical, and this is reflected in the size of the decay exponent b. Moreover, the exponent varies between different organisms, with those of smaller genomes displaying a steeper decay (i.e. larger b). Clearly, the power law indicates a preference to duplicate genes that encode for molecular parts which are already common. Here, we present a minimal, but biologically meaningful model that accurately describes the observed power law. Although the model performs equally well for all three protein classes, we focus on the occurrence of folds in preference to families and superfamilies. This is because folds are comparatively insensitive to the effects of point mutations that can cause a family member to diverge beyond detectable similarity. In the model, genomes evolve through two basic operations: (i) duplication of existing genes; (ii) net flow of new genes. The flow term is closely related to the exponent b and can accommodate considerable gene loss; however, we demonstrate that the observed data is reproduced best with a net inflow, i.e. with more gene gain than loss. Moreover, we show that prokaryotes have much higher rates of gene acquisition than eukaryotes, probably reflecting lateral transfer. A further natural outcome from our model is an estimation of the fold composition of the initial genome, which potentially relates to the common ancestor for modern organisms. Supplementary material pertaining to this work is available from www.partslist.org/powerlaw.",
keywords = "Bioinformatics, Evolution, Genomics, Power law, Protein families, Protein folds, Protein superfamilies, Proteomics",
author = "Jiang Qian and Luscombe, {Nicholas M.} and Mark Gerstein",
year = "2001",
month = "11",
day = "2",
doi = "10.1006/jmbi.2001.5079",
language = "English (US)",
volume = "313",
pages = "673--681",
journal = "Journal of Molecular Biology",
issn = "0022-2836",
publisher = "Academic Press Inc.",
number = "4",

}

TY - JOUR

T1 - Protein family and fold occurrence in genomes

T2 - Power-law behaviour and evolutionary model

AU - Qian, Jiang

AU - Luscombe, Nicholas M.

AU - Gerstein, Mark

PY - 2001/11/2

Y1 - 2001/11/2

N2 - Global surveys of genomes measure the usage of essential molecular parts, defined here as protein families, superfamilies or folds, in different organisms. Based on surveys of the first 20 completely sequenced genomes, we observe that the occurrence of these parts follows a power-law distribution. That is, the number of distinct parts (F) with a given genomic occurrence (V) decays as F = aV-b, with a few parts occurring many times and most occurring infrequently. For a given organism, the distributions of families, superfamilies and folds are nearly identical, and this is reflected in the size of the decay exponent b. Moreover, the exponent varies between different organisms, with those of smaller genomes displaying a steeper decay (i.e. larger b). Clearly, the power law indicates a preference to duplicate genes that encode for molecular parts which are already common. Here, we present a minimal, but biologically meaningful model that accurately describes the observed power law. Although the model performs equally well for all three protein classes, we focus on the occurrence of folds in preference to families and superfamilies. This is because folds are comparatively insensitive to the effects of point mutations that can cause a family member to diverge beyond detectable similarity. In the model, genomes evolve through two basic operations: (i) duplication of existing genes; (ii) net flow of new genes. The flow term is closely related to the exponent b and can accommodate considerable gene loss; however, we demonstrate that the observed data is reproduced best with a net inflow, i.e. with more gene gain than loss. Moreover, we show that prokaryotes have much higher rates of gene acquisition than eukaryotes, probably reflecting lateral transfer. A further natural outcome from our model is an estimation of the fold composition of the initial genome, which potentially relates to the common ancestor for modern organisms. Supplementary material pertaining to this work is available from www.partslist.org/powerlaw.

AB - Global surveys of genomes measure the usage of essential molecular parts, defined here as protein families, superfamilies or folds, in different organisms. Based on surveys of the first 20 completely sequenced genomes, we observe that the occurrence of these parts follows a power-law distribution. That is, the number of distinct parts (F) with a given genomic occurrence (V) decays as F = aV-b, with a few parts occurring many times and most occurring infrequently. For a given organism, the distributions of families, superfamilies and folds are nearly identical, and this is reflected in the size of the decay exponent b. Moreover, the exponent varies between different organisms, with those of smaller genomes displaying a steeper decay (i.e. larger b). Clearly, the power law indicates a preference to duplicate genes that encode for molecular parts which are already common. Here, we present a minimal, but biologically meaningful model that accurately describes the observed power law. Although the model performs equally well for all three protein classes, we focus on the occurrence of folds in preference to families and superfamilies. This is because folds are comparatively insensitive to the effects of point mutations that can cause a family member to diverge beyond detectable similarity. In the model, genomes evolve through two basic operations: (i) duplication of existing genes; (ii) net flow of new genes. The flow term is closely related to the exponent b and can accommodate considerable gene loss; however, we demonstrate that the observed data is reproduced best with a net inflow, i.e. with more gene gain than loss. Moreover, we show that prokaryotes have much higher rates of gene acquisition than eukaryotes, probably reflecting lateral transfer. A further natural outcome from our model is an estimation of the fold composition of the initial genome, which potentially relates to the common ancestor for modern organisms. Supplementary material pertaining to this work is available from www.partslist.org/powerlaw.

KW - Bioinformatics

KW - Evolution

KW - Genomics

KW - Power law

KW - Protein families

KW - Protein folds

KW - Protein superfamilies

KW - Proteomics

UR - http://www.scopus.com/inward/record.url?scp=0035798398&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035798398&partnerID=8YFLogxK

U2 - 10.1006/jmbi.2001.5079

DO - 10.1006/jmbi.2001.5079

M3 - Article

VL - 313

SP - 673

EP - 681

JO - Journal of Molecular Biology

JF - Journal of Molecular Biology

SN - 0022-2836

IS - 4

ER -