Measures of semantic similarity and relatedness in the biomedical domain

Ted Pedersen, Serguei V S Pakhomov, Siddharth Patwardhan, Christopher Chute

Research output: Contribution to journalArticle

Abstract

Measures of semantic similarity between concepts are widely used in Natural Language Processing. In this article, we show how six existing domain-independent measures can be adapted to the biomedical domain. These measures were originally based on WordNet, an English lexical database of concepts and relations. In this research, we adapt these measures to the SNOMED-CT® ontology of medical concepts. The measures include two path-based measures, and three measures that augment path-based measures with information content statistics from corpora. We also derive a context vector measure based on medical corpora that can be used as a measure of semantic relatedness. These six measures are evaluated against a newly created test bed of 30 medical concept pairs scored by three physicians and nine medical coders. We find that the medical coders and physicians differ in their ratings, and that the context vector measure correlates most closely with the physicians, while the path-based measures and one of the information content measures correlates most closely with the medical coders. We conclude that there is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures.

Original languageEnglish (US)
Pages (from-to)288-299
Number of pages12
JournalJournal of Biomedical Informatics
Volume40
Issue number3
DOIs
StatePublished - Jun 2007
Externally publishedYes

Fingerprint

Semantics
Physicians
Systematized Nomenclature of Medicine
Natural Language Processing
Ontology
Statistics
Processing
Databases
Research

Keywords

  • Context vectors
  • Information content
  • Path based measures
  • Semantic similarity
  • SNOMED-CT

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics
  • Computer Science (miscellaneous)
  • Catalysis

Cite this

Measures of semantic similarity and relatedness in the biomedical domain. / Pedersen, Ted; Pakhomov, Serguei V S; Patwardhan, Siddharth; Chute, Christopher.

In: Journal of Biomedical Informatics, Vol. 40, No. 3, 06.2007, p. 288-299.

Research output: Contribution to journalArticle

Pedersen, Ted ; Pakhomov, Serguei V S ; Patwardhan, Siddharth ; Chute, Christopher. / Measures of semantic similarity and relatedness in the biomedical domain. In: Journal of Biomedical Informatics. 2007 ; Vol. 40, No. 3. pp. 288-299.
@article{468e3c9ce9bf498db055273610e68adf,
title = "Measures of semantic similarity and relatedness in the biomedical domain",
abstract = "Measures of semantic similarity between concepts are widely used in Natural Language Processing. In this article, we show how six existing domain-independent measures can be adapted to the biomedical domain. These measures were originally based on WordNet, an English lexical database of concepts and relations. In this research, we adapt these measures to the SNOMED-CT{\circledR} ontology of medical concepts. The measures include two path-based measures, and three measures that augment path-based measures with information content statistics from corpora. We also derive a context vector measure based on medical corpora that can be used as a measure of semantic relatedness. These six measures are evaluated against a newly created test bed of 30 medical concept pairs scored by three physicians and nine medical coders. We find that the medical coders and physicians differ in their ratings, and that the context vector measure correlates most closely with the physicians, while the path-based measures and one of the information content measures correlates most closely with the medical coders. We conclude that there is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures.",
keywords = "Context vectors, Information content, Path based measures, Semantic similarity, SNOMED-CT",
author = "Ted Pedersen and Pakhomov, {Serguei V S} and Siddharth Patwardhan and Christopher Chute",
year = "2007",
month = "6",
doi = "10.1016/j.jbi.2006.06.004",
language = "English (US)",
volume = "40",
pages = "288--299",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",
number = "3",

}

TY - JOUR

T1 - Measures of semantic similarity and relatedness in the biomedical domain

AU - Pedersen, Ted

AU - Pakhomov, Serguei V S

AU - Patwardhan, Siddharth

AU - Chute, Christopher

PY - 2007/6

Y1 - 2007/6

N2 - Measures of semantic similarity between concepts are widely used in Natural Language Processing. In this article, we show how six existing domain-independent measures can be adapted to the biomedical domain. These measures were originally based on WordNet, an English lexical database of concepts and relations. In this research, we adapt these measures to the SNOMED-CT® ontology of medical concepts. The measures include two path-based measures, and three measures that augment path-based measures with information content statistics from corpora. We also derive a context vector measure based on medical corpora that can be used as a measure of semantic relatedness. These six measures are evaluated against a newly created test bed of 30 medical concept pairs scored by three physicians and nine medical coders. We find that the medical coders and physicians differ in their ratings, and that the context vector measure correlates most closely with the physicians, while the path-based measures and one of the information content measures correlates most closely with the medical coders. We conclude that there is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures.

AB - Measures of semantic similarity between concepts are widely used in Natural Language Processing. In this article, we show how six existing domain-independent measures can be adapted to the biomedical domain. These measures were originally based on WordNet, an English lexical database of concepts and relations. In this research, we adapt these measures to the SNOMED-CT® ontology of medical concepts. The measures include two path-based measures, and three measures that augment path-based measures with information content statistics from corpora. We also derive a context vector measure based on medical corpora that can be used as a measure of semantic relatedness. These six measures are evaluated against a newly created test bed of 30 medical concept pairs scored by three physicians and nine medical coders. We find that the medical coders and physicians differ in their ratings, and that the context vector measure correlates most closely with the physicians, while the path-based measures and one of the information content measures correlates most closely with the medical coders. We conclude that there is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures.

KW - Context vectors

KW - Information content

KW - Path based measures

KW - Semantic similarity

KW - SNOMED-CT

UR - http://www.scopus.com/inward/record.url?scp=34248172904&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34248172904&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2006.06.004

DO - 10.1016/j.jbi.2006.06.004

M3 - Article

C2 - 16875881

AN - SCOPUS:34248172904

VL - 40

SP - 288

EP - 299

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

IS - 3

ER -