Information extraction for clinical data mining: A mammography case study

Houssam Nassif, Ryan Woods, Elizabeth Burnside, Mehmet Ayvaci, Jude Shavlik, David Page

Research output: Chapter in Book/Report/Conference proceedingConference contribution

38 Scopus citations

Abstract

Breast cancer is the leading cause of cancer mortality in women between the ages of 15 and 54. During mammography screening, radiologists use a strict lexicon (BI-RADS) to describe and report their findings. Mammography records are then stored in a well-defined database format (NMD). Lately, researchers have applied data mining and machine learning techniques to these databases. They successfully built breast cancer classifiers that can help in early detection of malignancy. However, the validity of these models depends on the quality of the underlying databases. Unfortunately, most databases suffer from inconsistencies, missing data, inter-observer variability and inappropriate term usage. In addition, many databases are not compliant with the NMD format and/or solely consist of text reports. BI-RADS feature extraction from free text and consistency checks between recorded predictive variables and text reports are crucial to addressing this problem. We describe a general scheme for concept information retrieval from free text given a lexicon, and present a BI-RADS features extraction algorithm for clinical data mining. It consists of a syntax analyzer, a concept finder and a negation detector. The syntax analyzer preprocesses the input into individual sentences. The concept finder uses a semantic grammar based on the BI-RADS lexicon and the experts' input. It parses sentences detecting BI-RADS concepts. Once a concept is located, a lexical scanner checks for negation. Our method can handle multiple latent concepts within the text, filtering out ultrasound concepts. On our dataset, our algorithm achieves 97.7% precision, 95.5% recall and an F1-score of 0.97. It outperforms manual feature extraction at the 5% statistical significance level.

Original languageEnglish (US)
Title of host publicationICDM Workshops 2009 - IEEE International Conference on Data Mining
Pages37-42
Number of pages6
DOIs
StatePublished - 2009
Externally publishedYes
Event2009 IEEE International Conference on Data Mining Workshops, ICDMW 2009 - Miami, FL, United States
Duration: Dec 6 2009Dec 6 2009

Publication series

NameICDM Workshops 2009 - IEEE International Conference on Data Mining

Other

Other2009 IEEE International Conference on Data Mining Workshops, ICDMW 2009
Country/TerritoryUnited States
CityMiami, FL
Period12/6/0912/6/09

Keywords

  • BI-RADS
  • Clinical data mining
  • Free text
  • Lexicon
  • Mammography

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Vision and Pattern Recognition
  • Software

Fingerprint

Dive into the research topics of 'Information extraction for clinical data mining: A mammography case study'. Together they form a unique fingerprint.

Cite this