Identification of key concepts in biomedical literature using a modified Markov heuristic

W. H. Majoros, G. M. Subramanian, M. D. Yandell

Research output: Contribution to journalArticle

Abstract

Motivation: The recent explosion of interest in mining the biomedical literature for associations between defined entities such as genes, diseases and drugs has made apparent the need for robust methods of identifying occurrences of these entities in biomedical text. Such concept-based indexing is strongly dependent on the availability of a comprehensive ontology or lexicon of biomedical terms. However, such ontologies are very difficult and expensive to construct, and often require extensive manual curation to render them suitable for use by automatic indexing programs. Furthermore, the use of statistically salient noun phrases as surrogates for curated terminology is not without difficulties, due to the lack of high-quality part-of-speech taggers specific to medical nomenclature. Results: We describe a method of improving the quality of automatically extracted noun phrases by employing prior knowledge during the HMM training procedure for the tagger. This enhancement, when combined with appropriate training data, can greatly improve the quality and relevance of the extracted phrases, thereby enabling greater accuracy in downstream literature mining tasks.

Original languageEnglish (US)
Pages (from-to)402-407
Number of pages6
JournalBioinformatics
Volume19
Issue number3
DOIs
StatePublished - Feb 12 2003
Externally publishedYes

Fingerprint

Terminology
Ontology
Automatic indexing
Heuristics
Indexing
Mining
Explosions
Genes
Availability
Robust Methods
Prior Knowledge
Explosion
Drugs
Enhancement
Pharmaceutical Preparations
Gene
Dependent
Term
Concepts
Training

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Identification of key concepts in biomedical literature using a modified Markov heuristic. / Majoros, W. H.; Subramanian, G. M.; Yandell, M. D.

In: Bioinformatics, Vol. 19, No. 3, 12.02.2003, p. 402-407.

Research output: Contribution to journalArticle

Majoros, W. H. ; Subramanian, G. M. ; Yandell, M. D. / Identification of key concepts in biomedical literature using a modified Markov heuristic. In: Bioinformatics. 2003 ; Vol. 19, No. 3. pp. 402-407.
@article{d8f51979b8964ba380a96c22c6e6580e,
title = "Identification of key concepts in biomedical literature using a modified Markov heuristic",
abstract = "Motivation: The recent explosion of interest in mining the biomedical literature for associations between defined entities such as genes, diseases and drugs has made apparent the need for robust methods of identifying occurrences of these entities in biomedical text. Such concept-based indexing is strongly dependent on the availability of a comprehensive ontology or lexicon of biomedical terms. However, such ontologies are very difficult and expensive to construct, and often require extensive manual curation to render them suitable for use by automatic indexing programs. Furthermore, the use of statistically salient noun phrases as surrogates for curated terminology is not without difficulties, due to the lack of high-quality part-of-speech taggers specific to medical nomenclature. Results: We describe a method of improving the quality of automatically extracted noun phrases by employing prior knowledge during the HMM training procedure for the tagger. This enhancement, when combined with appropriate training data, can greatly improve the quality and relevance of the extracted phrases, thereby enabling greater accuracy in downstream literature mining tasks.",
author = "Majoros, {W. H.} and Subramanian, {G. M.} and Yandell, {M. D.}",
year = "2003",
month = "2",
day = "12",
doi = "10.1093/bioinformatics/btg010",
language = "English (US)",
volume = "19",
pages = "402--407",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "3",

}

TY - JOUR

T1 - Identification of key concepts in biomedical literature using a modified Markov heuristic

AU - Majoros, W. H.

AU - Subramanian, G. M.

AU - Yandell, M. D.

PY - 2003/2/12

Y1 - 2003/2/12

N2 - Motivation: The recent explosion of interest in mining the biomedical literature for associations between defined entities such as genes, diseases and drugs has made apparent the need for robust methods of identifying occurrences of these entities in biomedical text. Such concept-based indexing is strongly dependent on the availability of a comprehensive ontology or lexicon of biomedical terms. However, such ontologies are very difficult and expensive to construct, and often require extensive manual curation to render them suitable for use by automatic indexing programs. Furthermore, the use of statistically salient noun phrases as surrogates for curated terminology is not without difficulties, due to the lack of high-quality part-of-speech taggers specific to medical nomenclature. Results: We describe a method of improving the quality of automatically extracted noun phrases by employing prior knowledge during the HMM training procedure for the tagger. This enhancement, when combined with appropriate training data, can greatly improve the quality and relevance of the extracted phrases, thereby enabling greater accuracy in downstream literature mining tasks.

AB - Motivation: The recent explosion of interest in mining the biomedical literature for associations between defined entities such as genes, diseases and drugs has made apparent the need for robust methods of identifying occurrences of these entities in biomedical text. Such concept-based indexing is strongly dependent on the availability of a comprehensive ontology or lexicon of biomedical terms. However, such ontologies are very difficult and expensive to construct, and often require extensive manual curation to render them suitable for use by automatic indexing programs. Furthermore, the use of statistically salient noun phrases as surrogates for curated terminology is not without difficulties, due to the lack of high-quality part-of-speech taggers specific to medical nomenclature. Results: We describe a method of improving the quality of automatically extracted noun phrases by employing prior knowledge during the HMM training procedure for the tagger. This enhancement, when combined with appropriate training data, can greatly improve the quality and relevance of the extracted phrases, thereby enabling greater accuracy in downstream literature mining tasks.

UR - http://www.scopus.com/inward/record.url?scp=0037433048&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037433048&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btg010

DO - 10.1093/bioinformatics/btg010

M3 - Article

C2 - 12584127

AN - SCOPUS:0037433048

VL - 19

SP - 402

EP - 407

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 3

ER -