Domain-specific language models and lexicons for tagging

Anni R. Coden, Serguei V. Pakhomov, Rie K. Ando, Patrick H. Duffy, Christopher G. Chute

Research output: Contribution to journalArticle

Abstract

Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.

Original languageEnglish (US)
Pages (from-to)422-430
Number of pages9
JournalJournal of Biomedical Informatics
Volume38
Issue number6
DOIs
StatePublished - Dec 1 2005
Externally publishedYes

    Fingerprint

Keywords

  • Biomedical domain
  • Clinical information systems
  • Clinical report analysis
  • Corpus linguistics
  • Domain adaptation
  • Hidden Markov Model
  • Part-of-speech tagging accuracy
  • Statistical part-of-speech tagging

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Cite this