Identity uncertainty and citation matching

Hanna Pasula; Bhaskara Marthi; Brian Milch; Stuart Russell; Ilya Shpitser

Identity uncertainty and citation matching

Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.

Original language	English (US)
Title of host publication	Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002
Publisher	Neural information processing systems foundation
ISBN (Print)	0262025507, 9780262025508
State	Published - 2003
Externally published	Yes
Event	16th Annual Neural Information Processing Systems Conference, NIPS 2002 - Vancouver, BC, Canada Duration: Dec 9 2002 → Dec 14 2002

Publication series

Name	Advances in Neural Information Processing Systems
ISSN (Print)	1049-5258

Other

Other	16th Annual Neural Information Processing Systems Conference, NIPS 2002
Country/Territory	Canada
City	Vancouver, BC
Period	12/9/02 → 12/14/02

ASJC Scopus subject areas

Computer Networks and Communications
Information Systems
Signal Processing

Cite this

Identity uncertainty and citation matching. / Pasula, Hanna; Marthi, Bhaskara; Milch, Brian et al.
Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002. Neural information processing systems foundation, 2003. (Advances in Neural Information Processing Systems).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Pasula, H, Marthi, B, Milch, B, Russell, S & Shpitser, I 2003, Identity uncertainty and citation matching. in Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002. Advances in Neural Information Processing Systems, Neural information processing systems foundation, 16th Annual Neural Information Processing Systems Conference, NIPS 2002, Vancouver, BC, Canada, 12/9/02.

@inproceedings{742e8218924a42918cf025271e9c43c5,

title = "Identity uncertainty and citation matching",

abstract = "Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.",

author = "Hanna Pasula and Bhaskara Marthi and Brian Milch and Stuart Russell and Ilya Shpitser",

year = "2003",

language = "English (US)",

isbn = "0262025507",

series = "Advances in Neural Information Processing Systems",

publisher = "Neural information processing systems foundation",

booktitle = "Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002",

note = "16th Annual Neural Information Processing Systems Conference, NIPS 2002 ; Conference date: 09-12-2002 Through 14-12-2002",

}

TY - GEN

T1 - Identity uncertainty and citation matching

AU - Pasula, Hanna

AU - Marthi, Bhaskara

AU - Milch, Brian

AU - Russell, Stuart

AU - Shpitser, Ilya

PY - 2003

Y1 - 2003

N2 - Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.

AB - Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.

UR - http://www.scopus.com/inward/record.url?scp=84898987614&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898987614&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84898987614

SN - 0262025507

SN - 9780262025508

T3 - Advances in Neural Information Processing Systems

BT - Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002

PB - Neural information processing systems foundation

T2 - 16th Annual Neural Information Processing Systems Conference, NIPS 2002

Y2 - 9 December 2002 through 14 December 2002

ER -

Identity uncertainty and citation matching

Abstract

Publication series

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this