Identity uncertainty and citation matching

Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.

Original languageEnglish (US)
Title of host publicationAdvances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002
PublisherNeural information processing systems foundation
ISBN (Print)0262025507, 9780262025508
StatePublished - Jan 1 2003
Externally publishedYes
Event16th Annual Neural Information Processing Systems Conference, NIPS 2002 - Vancouver, BC, Canada
Duration: Dec 9 2002Dec 14 2002

Other

Other16th Annual Neural Information Processing Systems Conference, NIPS 2002
CountryCanada
CityVancouver, BC
Period12/9/0212/14/02

Fingerprint

Markov processes
Uncertainty

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002 Neural information processing systems foundation.

Identity uncertainty and citation matching. / Pasula, Hanna; Marthi, Bhaskara; Milch, Brian; Russell, Stuart; Shpitser, Ilya.

Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002. Neural information processing systems foundation, 2003.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pasula, H, Marthi, B, Milch, B, Russell, S & Shpitser, I 2003, Identity uncertainty and citation matching. in Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002. Neural information processing systems foundation, 16th Annual Neural Information Processing Systems Conference, NIPS 2002, Vancouver, BC, Canada, 12/9/02.
Pasula H, Marthi B, Milch B, Russell S, Shpitser I. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002. Neural information processing systems foundation. 2003
Pasula, Hanna ; Marthi, Bhaskara ; Milch, Brian ; Russell, Stuart ; Shpitser, Ilya. / Identity uncertainty and citation matching. Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002. Neural information processing systems foundation, 2003.
@inproceedings{742e8218924a42918cf025271e9c43c5,
title = "Identity uncertainty and citation matching",
abstract = "Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.",
author = "Hanna Pasula and Bhaskara Marthi and Brian Milch and Stuart Russell and Ilya Shpitser",
year = "2003",
month = "1",
day = "1",
language = "English (US)",
isbn = "0262025507",
booktitle = "Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002",
publisher = "Neural information processing systems foundation",

}

TY - GEN

T1 - Identity uncertainty and citation matching

AU - Pasula, Hanna

AU - Marthi, Bhaskara

AU - Milch, Brian

AU - Russell, Stuart

AU - Shpitser, Ilya

PY - 2003/1/1

Y1 - 2003/1/1

N2 - Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.

AB - Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.

UR - http://www.scopus.com/inward/record.url?scp=84898987614&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898987614&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84898987614

SN - 0262025507

SN - 9780262025508

BT - Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002

PB - Neural information processing systems foundation

ER -