A study of crowdsourced segment-level surgical skill assessment using pairwise rankings

Anand Malpani, S. Swaroop Vedula, Chi Chiung Grace Chen, Gregory D. Hager

Research output: Contribution to journalArticlepeer-review


Purpose: Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores. Methods: Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers. Results: We observed moderate inter-rater reliability within the crowd (Fleiss’ kappa, $$\kappa = 0.41$$κ=0.41) and experts ($$\kappa = 0.55$$κ=0.55). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores ($$\rho \ge 0.86$$ρ≥0.86) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other ($$\rho \ge 0.84$$ρ≥0.84), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin. Conclusions: Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.

Original languageEnglish (US)
Pages (from-to)1435-1447
Number of pages13
JournalInternational Journal of Computer Assisted Radiology and Surgery
Issue number9
StatePublished - Sep 13 2015


  • Activity segments
  • Crowdsourcing
  • Feedback
  • Pairwise comparisons
  • Robotic surgery
  • Skill assessment
  • Task decomposition
  • Task flow
  • Training

ASJC Scopus subject areas

  • Surgery
  • Biomedical Engineering
  • Radiology Nuclear Medicine and imaging
  • Computer Vision and Pattern Recognition
  • Computer Science Applications
  • Health Informatics
  • Computer Graphics and Computer-Aided Design


Dive into the research topics of 'A study of crowdsourced segment-level surgical skill assessment using pairwise rankings'. Together they form a unique fingerprint.

Cite this