Domain-specific language models and lexicons for tagging

Anni R. Coden, Serguei V. Pakhomov, Rie K. Ando, Patrick H. Duffy, Christopher Chute

Research output: Contribution to journalArticle

Abstract

Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.

Original languageEnglish (US)
Pages (from-to)422-430
Number of pages9
JournalJournal of Biomedical Informatics
Volume38
Issue number6
DOIs
StatePublished - Dec 2005
Externally publishedYes

Fingerprint

Language
Natural Language Processing
Data Mining
Information Storage and Retrieval
Processing
Information retrieval
Data mining
Costs and Cost Analysis
Costs

Keywords

  • Biomedical domain
  • Clinical information systems
  • Clinical report analysis
  • Corpus linguistics
  • Domain adaptation
  • Hidden Markov Model
  • Part-of-speech tagging accuracy
  • Statistical part-of-speech tagging

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics
  • Computer Science (miscellaneous)
  • Catalysis

Cite this

Domain-specific language models and lexicons for tagging. / Coden, Anni R.; Pakhomov, Serguei V.; Ando, Rie K.; Duffy, Patrick H.; Chute, Christopher.

In: Journal of Biomedical Informatics, Vol. 38, No. 6, 12.2005, p. 422-430.

Research output: Contribution to journalArticle

Coden, Anni R. ; Pakhomov, Serguei V. ; Ando, Rie K. ; Duffy, Patrick H. ; Chute, Christopher. / Domain-specific language models and lexicons for tagging. In: Journal of Biomedical Informatics. 2005 ; Vol. 38, No. 6. pp. 422-430.
@article{614cfb2bbcda4511b3e7b67806602b98,
title = "Domain-specific language models and lexicons for tagging",
abstract = "Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92{\%} accuracy from 87{\%} in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.",
keywords = "Biomedical domain, Clinical information systems, Clinical report analysis, Corpus linguistics, Domain adaptation, Hidden Markov Model, Part-of-speech tagging accuracy, Statistical part-of-speech tagging",
author = "Coden, {Anni R.} and Pakhomov, {Serguei V.} and Ando, {Rie K.} and Duffy, {Patrick H.} and Christopher Chute",
year = "2005",
month = "12",
doi = "10.1016/j.jbi.2005.02.009",
language = "English (US)",
volume = "38",
pages = "422--430",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",
number = "6",

}

TY - JOUR

T1 - Domain-specific language models and lexicons for tagging

AU - Coden, Anni R.

AU - Pakhomov, Serguei V.

AU - Ando, Rie K.

AU - Duffy, Patrick H.

AU - Chute, Christopher

PY - 2005/12

Y1 - 2005/12

N2 - Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.

AB - Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.

KW - Biomedical domain

KW - Clinical information systems

KW - Clinical report analysis

KW - Corpus linguistics

KW - Domain adaptation

KW - Hidden Markov Model

KW - Part-of-speech tagging accuracy

KW - Statistical part-of-speech tagging

UR - http://www.scopus.com/inward/record.url?scp=28744437703&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=28744437703&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2005.02.009

DO - 10.1016/j.jbi.2005.02.009

M3 - Article

VL - 38

SP - 422

EP - 430

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

IS - 6

ER -