Document clustering using small world communities

Brant W. Chee, Bruce Schatz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Words in natural language documents exhibit a small world network structure. Thus the physics community provides us with an extensive supply of algorithms for extracting community structure. We present a novel method for semantically clustering a large collection of documents using small world communities. This method combines modified physics algorithms with traditional information retrieval techniques. A term network is generated from the document collection, the terms are clustered into small world communities, the semantic term clusters are used to generate overlapping document clusters. The algorithm combines the speed of single link with the quality of complete link. Clustering takes place in nearly real-time and the results are judged to be coherent by expert users. Our algorithm occupies a middle ground between speed and quality of document clustering.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007
Subtitle of host publicationBuilding and Sustaining the Digital Environment
Pages53-62
Number of pages10
DOIs
StatePublished - Nov 29 2007
Externally publishedYes
Event7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment - Vancouver, BC, Canada
Duration: Jun 18 2007Jun 23 2007

Other

Other7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment
CountryCanada
CityVancouver, BC
Period6/18/076/23/07

Fingerprint

community
Physics
physics
Small-world networks
Information retrieval
information retrieval
Semantics
semantics
expert
supply
language
time

Keywords

  • Community structure
  • Document clustering
  • Scale-free networks
  • Semantic clustering
  • Small worlds

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Cite this

Chee, B. W., & Schatz, B. (2007). Document clustering using small world communities. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment (pp. 53-62) https://doi.org/10.1145/1255175.1255186

Document clustering using small world communities. / Chee, Brant W.; Schatz, Bruce.

Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. p. 53-62.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chee, BW & Schatz, B 2007, Document clustering using small world communities. in Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. pp. 53-62, 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment, Vancouver, BC, Canada, 6/18/07. https://doi.org/10.1145/1255175.1255186
Chee BW, Schatz B. Document clustering using small world communities. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. p. 53-62 https://doi.org/10.1145/1255175.1255186
Chee, Brant W. ; Schatz, Bruce. / Document clustering using small world communities. Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. pp. 53-62
@inproceedings{4c7e48a8893443f28049a8bd74299705,
title = "Document clustering using small world communities",
abstract = "Words in natural language documents exhibit a small world network structure. Thus the physics community provides us with an extensive supply of algorithms for extracting community structure. We present a novel method for semantically clustering a large collection of documents using small world communities. This method combines modified physics algorithms with traditional information retrieval techniques. A term network is generated from the document collection, the terms are clustered into small world communities, the semantic term clusters are used to generate overlapping document clusters. The algorithm combines the speed of single link with the quality of complete link. Clustering takes place in nearly real-time and the results are judged to be coherent by expert users. Our algorithm occupies a middle ground between speed and quality of document clustering.",
keywords = "Community structure, Document clustering, Scale-free networks, Semantic clustering, Small worlds",
author = "Chee, {Brant W.} and Bruce Schatz",
year = "2007",
month = "11",
day = "29",
doi = "10.1145/1255175.1255186",
language = "English (US)",
isbn = "1595936440",
pages = "53--62",
booktitle = "Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007",

}

TY - GEN

T1 - Document clustering using small world communities

AU - Chee, Brant W.

AU - Schatz, Bruce

PY - 2007/11/29

Y1 - 2007/11/29

N2 - Words in natural language documents exhibit a small world network structure. Thus the physics community provides us with an extensive supply of algorithms for extracting community structure. We present a novel method for semantically clustering a large collection of documents using small world communities. This method combines modified physics algorithms with traditional information retrieval techniques. A term network is generated from the document collection, the terms are clustered into small world communities, the semantic term clusters are used to generate overlapping document clusters. The algorithm combines the speed of single link with the quality of complete link. Clustering takes place in nearly real-time and the results are judged to be coherent by expert users. Our algorithm occupies a middle ground between speed and quality of document clustering.

AB - Words in natural language documents exhibit a small world network structure. Thus the physics community provides us with an extensive supply of algorithms for extracting community structure. We present a novel method for semantically clustering a large collection of documents using small world communities. This method combines modified physics algorithms with traditional information retrieval techniques. A term network is generated from the document collection, the terms are clustered into small world communities, the semantic term clusters are used to generate overlapping document clusters. The algorithm combines the speed of single link with the quality of complete link. Clustering takes place in nearly real-time and the results are judged to be coherent by expert users. Our algorithm occupies a middle ground between speed and quality of document clustering.

KW - Community structure

KW - Document clustering

KW - Scale-free networks

KW - Semantic clustering

KW - Small worlds

UR - http://www.scopus.com/inward/record.url?scp=36348957963&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36348957963&partnerID=8YFLogxK

U2 - 10.1145/1255175.1255186

DO - 10.1145/1255175.1255186

M3 - Conference contribution

AN - SCOPUS:36348957963

SN - 1595936440

SN - 9781595936448

SP - 53

EP - 62

BT - Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007

ER -