A study of crowdsourced segment-level surgical skill assessment using pairwise rankings

Anand Malpani; S. Swaroop Vedula; Chi Chiung Grace Chen; Gregory D. Hager

doi:10.1007/s11548-015-1238-6

A study of crowdsourced segment-level surgical skill assessment using pairwise rankings

Anand Malpani, S. Swaroop Vedula, Chi Chiung Grace Chen, Gregory D. Hager

Research output: Contribution to journal › Article › peer-review

23 Scopus citations

Abstract

Purpose: Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores. Methods: Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers. Results: We observed moderate inter-rater reliability within the crowd (Fleiss’ kappa, $$\kappa = 0.41$$κ=0.41) and experts ($$\kappa = 0.55$$κ=0.55). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores ($$\rho \ge 0.86$$ρ≥0.86) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other ($$\rho \ge 0.84$$ρ≥0.84), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin. Conclusions: Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.

Original language	English (US)
Pages (from-to)	1435-1447
Number of pages	13
Journal	International Journal of Computer Assisted Radiology and Surgery
Volume	10
Issue number	9
DOIs	https://doi.org/10.1007/s11548-015-1238-6
State	Published - Sep 13 2015

Keywords

Activity segments
Crowdsourcing
Feedback
Pairwise comparisons
Robotic surgery
Skill assessment
Task decomposition
Task flow
Training

ASJC Scopus subject areas

Surgery
Biomedical Engineering
Radiology Nuclear Medicine and imaging
Computer Vision and Pattern Recognition
Computer Science Applications
Health Informatics
Computer Graphics and Computer-Aided Design

Access to Document

10.1007/s11548-015-1238-6

Cite this

@article{e4723746b86640b9980ea9d7b40f9911,

title = "A study of crowdsourced segment-level surgical skill assessment using pairwise rankings",

abstract = "Purpose: Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores. Methods: Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers. Results: We observed moderate inter-rater reliability within the crowd (Fleiss{\textquoteright} kappa, $$\kappa = 0.41$$κ=0.41) and experts ($$\kappa = 0.55$$κ=0.55). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores ($$\rho \ge 0.86$$ρ≥0.86) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other ($$\rho \ge 0.84$$ρ≥0.84), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin. Conclusions: Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.",

keywords = "Activity segments, Crowdsourcing, Feedback, Pairwise comparisons, Robotic surgery, Skill assessment, Task decomposition, Task flow, Training",

author = "Anand Malpani and Vedula, {S. Swaroop} and Chen, {Chi Chiung Grace} and Hager, {Gregory D.}",

note = "Funding Information: We acknowledge all participants in our crowdsourcing user study, and Intuitive surgical, Inc., for facilitating capture of data from the dVSS. A combined effort from the Language of Surgery project team led to the development of the manual task segmentation. The Johns Hopkins Science of Learning Institute and internal funding from the Johns Hopkins University supported this work. Publisher Copyright: {\textcopyright} 2015, CARS.",

year = "2015",

month = sep,

day = "13",

doi = "10.1007/s11548-015-1238-6",

language = "English (US)",

volume = "10",

pages = "1435--1447",

journal = "International Journal of Computer Assisted Radiology and Surgery",

issn = "1861-6410",

publisher = "Springer Verlag",

number = "9",

}

TY - JOUR

T1 - A study of crowdsourced segment-level surgical skill assessment using pairwise rankings

AU - Malpani, Anand

AU - Vedula, S. Swaroop

AU - Chen, Chi Chiung Grace

AU - Hager, Gregory D.

N1 - Funding Information: We acknowledge all participants in our crowdsourcing user study, and Intuitive surgical, Inc., for facilitating capture of data from the dVSS. A combined effort from the Language of Surgery project team led to the development of the manual task segmentation. The Johns Hopkins Science of Learning Institute and internal funding from the Johns Hopkins University supported this work. Publisher Copyright: © 2015, CARS.

PY - 2015/9/13

Y1 - 2015/9/13

N2 - Purpose: Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores. Methods: Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers. Results: We observed moderate inter-rater reliability within the crowd (Fleiss’ kappa, $$\kappa = 0.41$$κ=0.41) and experts ($$\kappa = 0.55$$κ=0.55). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores ($$\rho \ge 0.86$$ρ≥0.86) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other ($$\rho \ge 0.84$$ρ≥0.84), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin. Conclusions: Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.

AB - Purpose: Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores. Methods: Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers. Results: We observed moderate inter-rater reliability within the crowd (Fleiss’ kappa, $$\kappa = 0.41$$κ=0.41) and experts ($$\kappa = 0.55$$κ=0.55). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores ($$\rho \ge 0.86$$ρ≥0.86) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other ($$\rho \ge 0.84$$ρ≥0.84), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin. Conclusions: Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.

KW - Activity segments

KW - Crowdsourcing

KW - Feedback

KW - Pairwise comparisons

KW - Robotic surgery

KW - Skill assessment

KW - Task decomposition

KW - Task flow

KW - Training

UR - http://www.scopus.com/inward/record.url?scp=84941418149&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84941418149&partnerID=8YFLogxK

U2 - 10.1007/s11548-015-1238-6

DO - 10.1007/s11548-015-1238-6

M3 - Article

C2 - 26133652

AN - SCOPUS:84941418149

SN - 1861-6410

VL - 10

SP - 1435

EP - 1447

JO - International Journal of Computer Assisted Radiology and Surgery

JF - International Journal of Computer Assisted Radiology and Surgery

IS - 9

ER -

A study of crowdsourced segment-level surgical skill assessment using pairwise rankings

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this