Developing a corpus of clinical notes manually annotated for part-of-speech

Serguei V. Pakhomov; Anni Coden; Christopher G. Chute

doi:10.1016/j.ijmedinf.2005.08.006

Developing a corpus of clinical notes manually annotated for part-of-speech

Serguei V. Pakhomov, Anni Coden, Christopher G. Chute

Research output: Contribution to journal › Article › peer-review

28 Scopus citations

Abstract

Purpose: This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Methods: Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. Results: We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Conclusion: Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text.

Original language	English (US)
Pages (from-to)	418-429
Number of pages	12
Journal	International Journal of Medical Informatics
Volume	75
Issue number	6
DOIs	https://doi.org/10.1016/j.ijmedinf.2005.08.006
State	Published - Jun 2006
Externally published	Yes

Keywords

Domain adaptation
Manual text annotation
Medical domain
Natural language processing
Statistical part-of-speech tagging
Text analysis

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1016/j.ijmedinf.2005.08.006

Cite this

@article{36b8c664dbba46a588b3c3e6e99f6b02,

title = "Developing a corpus of clinical notes manually annotated for part-of-speech",

abstract = "Purpose: This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Methods: Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. Results: We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Conclusion: Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text.",

keywords = "Domain adaptation, Manual text annotation, Medical domain, Natural language processing, Statistical part-of-speech tagging, Text analysis",

author = "Pakhomov, {Serguei V.} and Anni Coden and Chute, {Christopher G.}",

note = "Funding Information: We would like to thank our medical index experts Barbara Abbot, Pauline Funk and Debora Albrecht for their persistent efforts in the difficult task of corpus annotation. This work was done in part under the NLM Training grant (# T15 LM07041-19 ) . ",

year = "2006",

month = jun,

doi = "10.1016/j.ijmedinf.2005.08.006",

language = "English (US)",

volume = "75",

pages = "418--429",

journal = "International Journal of Medical Informatics",

issn = "1386-5056",

publisher = "Elsevier Ireland Ltd",

number = "6",

}

TY - JOUR

T1 - Developing a corpus of clinical notes manually annotated for part-of-speech

AU - Pakhomov, Serguei V.

AU - Coden, Anni

AU - Chute, Christopher G.

N1 - Funding Information: We would like to thank our medical index experts Barbara Abbot, Pauline Funk and Debora Albrecht for their persistent efforts in the difficult task of corpus annotation. This work was done in part under the NLM Training grant (# T15 LM07041-19 ) .

PY - 2006/6

Y1 - 2006/6

N2 - Purpose: This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Methods: Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. Results: We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Conclusion: Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text.

AB - Purpose: This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Methods: Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. Results: We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Conclusion: Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text.

KW - Domain adaptation

KW - Manual text annotation

KW - Medical domain

KW - Natural language processing

KW - Statistical part-of-speech tagging

KW - Text analysis

UR - http://www.scopus.com/inward/record.url?scp=33646159049&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33646159049&partnerID=8YFLogxK

U2 - 10.1016/j.ijmedinf.2005.08.006

DO - 10.1016/j.ijmedinf.2005.08.006

M3 - Article

C2 - 16169769

AN - SCOPUS:33646159049

SN - 1386-5056

VL - 75

SP - 418

EP - 429

JO - International Journal of Medical Informatics

JF - International Journal of Medical Informatics

IS - 6

ER -

Developing a corpus of clinical notes manually annotated for part-of-speech

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this