Using machine learning to optimize the quality of survey data: Protocol for a use case in India

Neha Shah; Diwakar Mohan; Jean Juste Harisson Bashingwa; Osama Ummer; Arpita Chakraborty; Amnesty E. LeFevre

doi:10.2196/17619

Using machine learning to optimize the quality of survey data: Protocol for a use case in India

Neha Shah, Diwakar Mohan, Jean Juste Harisson Bashingwa, Osama Ummer, Arpita Chakraborty, Amnesty E. LeFevre

Bloomberg School of Public Health

Research output: Contribution to journal › Review article › peer-review

Abstract

Background: Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. Objective: This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. Methods: In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. Results: Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. Conclusions: Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers.

Original language	English (US)
Article number	e17619
Journal	JMIR Research Protocols
Volume	9
Issue number	8
DOIs	https://doi.org/10.2196/17619
State	Published - Aug 2020

ASJC Scopus subject areas

General Medicine

Access to Document

10.2196/17619

Cite this

@article{0f4830c1b047421089d549b7ca4bc6eb,

title = "Using machine learning to optimize the quality of survey data: Protocol for a use case in India",

abstract = "Background: Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. Objective: This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. Methods: In the Kilkari impact evaluation{\textquoteright}s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don{\textquoteright}t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. Results: Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. Conclusions: Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers.",

author = "Neha Shah and Diwakar Mohan and Bashingwa, {Jean Juste Harisson} and Osama Ummer and Arpita Chakraborty and LeFevre, {Amnesty E.}",

note = "Publisher Copyright: {\textcopyright}Neha Shah, Diwakar Mohan, Jean Juste Harisson Bashingwa, Osama Ummer, Arpita Chakraborty, Amnesty E.",

year = "2020",

month = aug,

doi = "10.2196/17619",

language = "English (US)",

volume = "9",

journal = "JMIR Research Protocols",

issn = "1929-0748",

publisher = "JMIR Publications Inc.",

number = "8",

}

TY - JOUR

T1 - Using machine learning to optimize the quality of survey data

T2 - Protocol for a use case in India

AU - Shah, Neha

AU - Mohan, Diwakar

AU - Bashingwa, Jean Juste Harisson

AU - Ummer, Osama

AU - Chakraborty, Arpita

AU - LeFevre, Amnesty E.

N1 - Publisher Copyright: ©Neha Shah, Diwakar Mohan, Jean Juste Harisson Bashingwa, Osama Ummer, Arpita Chakraborty, Amnesty E.

PY - 2020/8

Y1 - 2020/8

N2 - Background: Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. Objective: This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. Methods: In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. Results: Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. Conclusions: Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers.

AB - Background: Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. Objective: This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. Methods: In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. Results: Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. Conclusions: Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers.

UR - http://www.scopus.com/inward/record.url?scp=85091516574&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85091516574&partnerID=8YFLogxK

U2 - 10.2196/17619

DO - 10.2196/17619

M3 - Review article

C2 - 32755886

AN - SCOPUS:85091516574

SN - 1929-0748

VL - 9

JO - JMIR Research Protocols

JF - JMIR Research Protocols

IS - 8

M1 - e17619

ER -

Using machine learning to optimize the quality of survey data: Protocol for a use case in India

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this