Completing the results of the 2013 Boston marathon

Dorit Hammerling; Matthew Cefalu; Jessi Cisewski; Francesca Dominici; Giovanni Parmigiani; Charles Paulson; Richard L. Smith

doi:10.1371/journal.pone.0093800

Completing the results of the 2013 Boston marathon

Dorit Hammerling, Matthew Cefalu, Jessi Cisewski, Francesca Dominici, Giovanni Parmigiani, Charles Paulson, Richard L. Smith

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

The 2013 Boston marathon was disrupted by two bombs placed near the finish line. The bombs resulted in three deaths and several hundred injuries. Of lesser concern, in the immediate aftermath, was the fact that nearly 6,000 runners failed to finish the race. We were approached by the marathon's organizers, the Boston Athletic Association (BAA), and asked to recommend a procedure for projecting finish times for the runners who could not complete the race. With assistance from the BAA, we created a dataset consisting of all the runners in the 2013 race who reached the halfway point but failed to finish, as well as all runners from the 2010 and 2011 Boston marathons. The data consist of split times from each of the 5 km sections of the course, as well as the final 2.2 km (from 40 km to the finish). The statistical objective is to predict the missing split times for the runners who failed to finish in 2013. We set this problem in the context of the matrix completion problem, examples of which include imputing missing data in DNA microarray experiments, and the Netflix prize problem. We propose five prediction methods and create a validation dataset to measure their performance by mean squared error and other measures. The best method used local regression based on a K-nearest-neighbors algorithm (KNN method), though several other methods produced results of similar quality. We show how the results were used to create projected times for the 2013 runners and discuss potential for future application of the same methodology. We present the whole project as an example of reproducible research, in that we are able to make the full data and all the algorithms we have used publicly available, which may facilitate future research extending the methods or proposing completely different approaches.

Original language	English (US)
Article number	e93800
Journal	PLoS One
Volume	9
Issue number	4
DOIs	https://doi.org/10.1371/journal.pone.0093800
State	Published - Apr 11 2014
Externally published	Yes

ASJC Scopus subject areas

General Agricultural and Biological Sciences
General Biochemistry, Genetics and Molecular Biology
General Medicine

Access to Document

10.1371/journal.pone.0093800

Cite this

@article{d938bbeea6554dd2955c7700964e9de0,

title = "Completing the results of the 2013 Boston marathon",

abstract = "The 2013 Boston marathon was disrupted by two bombs placed near the finish line. The bombs resulted in three deaths and several hundred injuries. Of lesser concern, in the immediate aftermath, was the fact that nearly 6,000 runners failed to finish the race. We were approached by the marathon's organizers, the Boston Athletic Association (BAA), and asked to recommend a procedure for projecting finish times for the runners who could not complete the race. With assistance from the BAA, we created a dataset consisting of all the runners in the 2013 race who reached the halfway point but failed to finish, as well as all runners from the 2010 and 2011 Boston marathons. The data consist of split times from each of the 5 km sections of the course, as well as the final 2.2 km (from 40 km to the finish). The statistical objective is to predict the missing split times for the runners who failed to finish in 2013. We set this problem in the context of the matrix completion problem, examples of which include imputing missing data in DNA microarray experiments, and the Netflix prize problem. We propose five prediction methods and create a validation dataset to measure their performance by mean squared error and other measures. The best method used local regression based on a K-nearest-neighbors algorithm (KNN method), though several other methods produced results of similar quality. We show how the results were used to create projected times for the 2013 runners and discuss potential for future application of the same methodology. We present the whole project as an example of reproducible research, in that we are able to make the full data and all the algorithms we have used publicly available, which may facilitate future research extending the methods or proposing completely different approaches.",

author = "Dorit Hammerling and Matthew Cefalu and Jessi Cisewski and Francesca Dominici and Giovanni Parmigiani and Charles Paulson and Smith, {Richard L.}",

year = "2014",

month = apr,

day = "11",

doi = "10.1371/journal.pone.0093800",

language = "English (US)",

volume = "9",

journal = "PLoS One",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "4",

}

TY - JOUR

T1 - Completing the results of the 2013 Boston marathon

AU - Hammerling, Dorit

AU - Cefalu, Matthew

AU - Cisewski, Jessi

AU - Dominici, Francesca

AU - Parmigiani, Giovanni

AU - Paulson, Charles

AU - Smith, Richard L.

PY - 2014/4/11

Y1 - 2014/4/11

N2 - The 2013 Boston marathon was disrupted by two bombs placed near the finish line. The bombs resulted in three deaths and several hundred injuries. Of lesser concern, in the immediate aftermath, was the fact that nearly 6,000 runners failed to finish the race. We were approached by the marathon's organizers, the Boston Athletic Association (BAA), and asked to recommend a procedure for projecting finish times for the runners who could not complete the race. With assistance from the BAA, we created a dataset consisting of all the runners in the 2013 race who reached the halfway point but failed to finish, as well as all runners from the 2010 and 2011 Boston marathons. The data consist of split times from each of the 5 km sections of the course, as well as the final 2.2 km (from 40 km to the finish). The statistical objective is to predict the missing split times for the runners who failed to finish in 2013. We set this problem in the context of the matrix completion problem, examples of which include imputing missing data in DNA microarray experiments, and the Netflix prize problem. We propose five prediction methods and create a validation dataset to measure their performance by mean squared error and other measures. The best method used local regression based on a K-nearest-neighbors algorithm (KNN method), though several other methods produced results of similar quality. We show how the results were used to create projected times for the 2013 runners and discuss potential for future application of the same methodology. We present the whole project as an example of reproducible research, in that we are able to make the full data and all the algorithms we have used publicly available, which may facilitate future research extending the methods or proposing completely different approaches.

AB - The 2013 Boston marathon was disrupted by two bombs placed near the finish line. The bombs resulted in three deaths and several hundred injuries. Of lesser concern, in the immediate aftermath, was the fact that nearly 6,000 runners failed to finish the race. We were approached by the marathon's organizers, the Boston Athletic Association (BAA), and asked to recommend a procedure for projecting finish times for the runners who could not complete the race. With assistance from the BAA, we created a dataset consisting of all the runners in the 2013 race who reached the halfway point but failed to finish, as well as all runners from the 2010 and 2011 Boston marathons. The data consist of split times from each of the 5 km sections of the course, as well as the final 2.2 km (from 40 km to the finish). The statistical objective is to predict the missing split times for the runners who failed to finish in 2013. We set this problem in the context of the matrix completion problem, examples of which include imputing missing data in DNA microarray experiments, and the Netflix prize problem. We propose five prediction methods and create a validation dataset to measure their performance by mean squared error and other measures. The best method used local regression based on a K-nearest-neighbors algorithm (KNN method), though several other methods produced results of similar quality. We show how the results were used to create projected times for the 2013 runners and discuss potential for future application of the same methodology. We present the whole project as an example of reproducible research, in that we are able to make the full data and all the algorithms we have used publicly available, which may facilitate future research extending the methods or proposing completely different approaches.

UR - http://www.scopus.com/inward/record.url?scp=84899626354&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84899626354&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0093800

DO - 10.1371/journal.pone.0093800

M3 - Article

C2 - 24727904

AN - SCOPUS:84899626354

SN - 1932-6203

VL - 9

JO - PLoS One

JF - PLoS One

IS - 4

M1 - e93800

ER -

Completing the results of the 2013 Boston marathon

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this