TY - GEN
T1 - Identity uncertainty and citation matching
AU - Pasula, Hanna
AU - Marthi, Bhaskara
AU - Milch, Brian
AU - Russell, Stuart
AU - Shpitser, Ilya
PY - 2003
Y1 - 2003
N2 - Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.
AB - Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching-the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers.
UR - http://www.scopus.com/inward/record.url?scp=84898987614&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84898987614&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84898987614
SN - 0262025507
SN - 9780262025508
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 15 - Proceedings of the 2002 Conference, NIPS 2002
PB - Neural information processing systems foundation
T2 - 16th Annual Neural Information Processing Systems Conference, NIPS 2002
Y2 - 9 December 2002 through 14 December 2002
ER -