Identity uncertainty and citation matching

Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.

Original languageEnglish (US)
Title of host publicationAdvances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002
PublisherNeural information processing systems foundation
ISBN (Print)0262025507, 9780262025508
StatePublished - Jan 1 2003
Event16th Annual Neural Information Processing Systems Conference, NIPS 2002 - Vancouver, BC, Canada
Duration: Dec 9 2002Dec 14 2002

Publication series

NameAdvances in Neural Information Processing Systems
ISSN (Print)1049-5258

Other

Other16th Annual Neural Information Processing Systems Conference, NIPS 2002
CountryCanada
CityVancouver, BC
Period12/9/0212/14/02

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint Dive into the research topics of 'Identity uncertainty and citation matching'. Together they form a unique fingerprint.

Cite this