Document clustering using small world communities

Brant W. Chee, Bruce Schatz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Words in natural language documents exhibit a small world network structure. Thus the physics community provides us with an extensive supply of algorithms for extracting community structure. We present a novel method for semantically clustering a large collection of documents using small world communities. This method combines modified physics algorithms with traditional information retrieval techniques. A term network is generated from the document collection, the terms are clustered into small world communities, the semantic term clusters are used to generate overlapping document clusters. The algorithm combines the speed of single link with the quality of complete link. Clustering takes place in nearly real-time and the results are judged to be coherent by expert users. Our algorithm occupies a middle ground between speed and quality of document clustering.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007
Subtitle of host publicationBuilding and Sustaining the Digital Environment
Pages53-62
Number of pages10
DOIs
StatePublished - Nov 29 2007
Event7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment - Vancouver, BC, Canada
Duration: Jun 18 2007Jun 23 2007

Publication series

NameProceedings of the ACM International Conference on Digital Libraries

Other

Other7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment
CountryCanada
CityVancouver, BC
Period6/18/076/23/07

Keywords

  • Community structure
  • Document clustering
  • Scale-free networks
  • Semantic clustering
  • Small worlds

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Fingerprint Dive into the research topics of 'Document clustering using small world communities'. Together they form a unique fingerprint.

Cite this