Studies of the size and morphology of anatomical structures rely on accurate and reproducible delineation of the structures, obtained either by human raters or automatic segmentation algorithms. Measures of reproducibility and variability are vital aspects of such studies and are usually estimated using repeated scans or repeated delineations (in the case of human raters). Methods exist for simultaneously estimating the true structure and rater performance parameters from multiple segmentations and have been demonstrated on volumetric images. In this work, we extend the applicability of previous methods onto two-dimensional surfaces parameterized as triangle meshes. Label homogeneity is enforced using a Markov random field formulated with an energy that addresses the challenges introduced by the surface parameterization. The method was tested using both simulated raters and cortical gyral labels. Simulated raters are computed using a global error model as well as a novel and more realistic boundary error model. We study the impact of raters and their accuracy based on both models, and show how effectively this method estimates the true segmentation on simulated surfaces. The Markov random field formulation was shown to effectively enforce homogeneity for raters suffering from label noise. We demonstrated that our method provides substantial improvements in accuracy over single-atlas methods for all experimental conditions.