Identifying similar cases in document networks using cross-reference structures

Taxiarchis Botsis; John Scott; Emily Jane Woo; Robert Ball

doi:10.1109/JBHI.2014.2345873

Identifying similar cases in document networks using cross-reference structures

Taxiarchis Botsis, John Scott, Emily Jane Woo, Robert Ball

Research output: Contribution to journal › Article › peer-review

8 Scopus citations

Abstract

Our objective was to explore the creation of document networks based on different thresholds of shared information and different clustering algorithms on those networks to identify document clusters describing similar clinical cases. We created networks from vaccine adverse event report sets using seven approaches for linking reports. We then applied three clustering algorithms [visualization of similarities (VOS), Louvain, k-means] to these networks and evaluated their ability to identify known clusters. The report sets included one simulated set and three sets from the Vaccine Adverse Event Reporting System; each was split into training and testing subsets. Training subsets were used to estimate parameter values for the clustering algorithms and testing subsets to evaluate clusters. We created the networks by linking reports based on shared information in the form either of individual Medical Dictionary for Regulatory Activities Preferred Terms (PTs) or of dyads, triplets, quadruplets, quintuplets, and sextuplets of PTs; we created another network by weighting the single PT network connections by Lin's information theoretic approach to similarity. We then repeated this entire process using networks based on text mining output rather than structured data. We evaluated report clustering using recall, precision, and f-measure. The VOS algorithm outperformed Louvain and k-means in general. The best weighting scheme appeared to be related to the complexity of the known cluster. For example, singleton weighting performed best for an intussusception cluster driven by a single PT. We observed marginal differences between the code- and textual-based clustering. In conclusion, our approach supported identification of similar nodes in a document network.

Original language	English (US)
Article number	6873230
Pages (from-to)	1906-1917
Number of pages	12
Journal	IEEE Journal of Biomedical and Health Informatics
Volume	19
Issue number	6
DOIs	https://doi.org/10.1109/JBHI.2014.2345873
State	Published - Nov 1 2015
Externally published	Yes

ASJC Scopus subject areas

Health Information Management
Health Informatics
Electrical and Electronic Engineering
Computer Science Applications

Access to Document

10.1109/JBHI.2014.2345873

Cite this

@article{673c0c22013e4af08ac675154e0d874a,

title = "Identifying similar cases in document networks using cross-reference structures",

abstract = "Our objective was to explore the creation of document networks based on different thresholds of shared information and different clustering algorithms on those networks to identify document clusters describing similar clinical cases. We created networks from vaccine adverse event report sets using seven approaches for linking reports. We then applied three clustering algorithms [visualization of similarities (VOS), Louvain, k-means] to these networks and evaluated their ability to identify known clusters. The report sets included one simulated set and three sets from the Vaccine Adverse Event Reporting System; each was split into training and testing subsets. Training subsets were used to estimate parameter values for the clustering algorithms and testing subsets to evaluate clusters. We created the networks by linking reports based on shared information in the form either of individual Medical Dictionary for Regulatory Activities Preferred Terms (PTs) or of dyads, triplets, quadruplets, quintuplets, and sextuplets of PTs; we created another network by weighting the single PT network connections by Lin's information theoretic approach to similarity. We then repeated this entire process using networks based on text mining output rather than structured data. We evaluated report clustering using recall, precision, and f-measure. The VOS algorithm outperformed Louvain and k-means in general. The best weighting scheme appeared to be related to the complexity of the known cluster. For example, singleton weighting performed best for an intussusception cluster driven by a single PT. We observed marginal differences between the code- and textual-based clustering. In conclusion, our approach supported identification of similar nodes in a document network.",

author = "Taxiarchis Botsis and John Scott and Woo, {Emily Jane} and Robert Ball",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.",

year = "2015",

month = nov,

day = "1",

doi = "10.1109/JBHI.2014.2345873",

language = "English (US)",

volume = "19",

pages = "1906--1917",

journal = "IEEE Journal of Biomedical and Health Informatics",

issn = "2168-2194",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "6",

}

TY - JOUR

T1 - Identifying similar cases in document networks using cross-reference structures

AU - Botsis, Taxiarchis

AU - Scott, John

AU - Woo, Emily Jane

AU - Ball, Robert

PY - 2015/11/1

Y1 - 2015/11/1

N2 - Our objective was to explore the creation of document networks based on different thresholds of shared information and different clustering algorithms on those networks to identify document clusters describing similar clinical cases. We created networks from vaccine adverse event report sets using seven approaches for linking reports. We then applied three clustering algorithms [visualization of similarities (VOS), Louvain, k-means] to these networks and evaluated their ability to identify known clusters. The report sets included one simulated set and three sets from the Vaccine Adverse Event Reporting System; each was split into training and testing subsets. Training subsets were used to estimate parameter values for the clustering algorithms and testing subsets to evaluate clusters. We created the networks by linking reports based on shared information in the form either of individual Medical Dictionary for Regulatory Activities Preferred Terms (PTs) or of dyads, triplets, quadruplets, quintuplets, and sextuplets of PTs; we created another network by weighting the single PT network connections by Lin's information theoretic approach to similarity. We then repeated this entire process using networks based on text mining output rather than structured data. We evaluated report clustering using recall, precision, and f-measure. The VOS algorithm outperformed Louvain and k-means in general. The best weighting scheme appeared to be related to the complexity of the known cluster. For example, singleton weighting performed best for an intussusception cluster driven by a single PT. We observed marginal differences between the code- and textual-based clustering. In conclusion, our approach supported identification of similar nodes in a document network.

AB - Our objective was to explore the creation of document networks based on different thresholds of shared information and different clustering algorithms on those networks to identify document clusters describing similar clinical cases. We created networks from vaccine adverse event report sets using seven approaches for linking reports. We then applied three clustering algorithms [visualization of similarities (VOS), Louvain, k-means] to these networks and evaluated their ability to identify known clusters. The report sets included one simulated set and three sets from the Vaccine Adverse Event Reporting System; each was split into training and testing subsets. Training subsets were used to estimate parameter values for the clustering algorithms and testing subsets to evaluate clusters. We created the networks by linking reports based on shared information in the form either of individual Medical Dictionary for Regulatory Activities Preferred Terms (PTs) or of dyads, triplets, quadruplets, quintuplets, and sextuplets of PTs; we created another network by weighting the single PT network connections by Lin's information theoretic approach to similarity. We then repeated this entire process using networks based on text mining output rather than structured data. We evaluated report clustering using recall, precision, and f-measure. The VOS algorithm outperformed Louvain and k-means in general. The best weighting scheme appeared to be related to the complexity of the known cluster. For example, singleton weighting performed best for an intussusception cluster driven by a single PT. We observed marginal differences between the code- and textual-based clustering. In conclusion, our approach supported identification of similar nodes in a document network.

UR - http://www.scopus.com/inward/record.url?scp=84959223827&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959223827&partnerID=8YFLogxK

U2 - 10.1109/JBHI.2014.2345873

DO - 10.1109/JBHI.2014.2345873

M3 - Article

C2 - 25122604

AN - SCOPUS:84959223827

SN - 2168-2194

VL - 19

SP - 1906

EP - 1917

JO - IEEE Journal of Biomedical and Health Informatics

JF - IEEE Journal of Biomedical and Health Informatics

IS - 6

M1 - 6873230

ER -

Identifying similar cases in document networks using cross-reference structures

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this