A highly efficient design strategy for regression with outcome pooling

Emily M. Mitchell, Robert H. Lyles, Amita K. Manatunga, Neil J. Perkins, Enrique F. Schisterman

Research output: Contribution to journalArticle

Abstract

The potential for research involving biospecimens can be hindered by the prohibitive cost of performing laboratory assays on individual samples. To mitigate this cost, strategies such as randomly selecting a portion of specimens for analysis or randomly pooling specimens prior to performing laboratory assays may be employed. These techniques, while effective in reducing cost, are often accompanied by a considerable loss of statistical efficiency. We propose a novel pooling strategy based on the k-means clustering algorithm to reduce laboratory costs while maintaining a high level of statistical efficiency when predictor variables are measured on all subjects, but the outcome of interest is assessed in pools. We perform simulations motivated by the BioCycle study to compare this k-means pooling strategy with current pooling and selection techniques under simple and multiple linear regression models. While all of the methods considered produce unbiased estimates and confidence intervals with appropriate coverage, pooling under k-means clustering provides the most precise estimates, closely approximating results from the full data and losing minimal precision as the total number of pools decreases. The benefits of k-means clustering evident in the simulation study are then applied to an analysis of the BioCycle dataset. In conclusion, when the number of lab tests is limited by budget, pooling specimens based on k-means clustering prior to performing lab assays can be an effective way to save money with minimal information loss in a regression setting.

Original languageEnglish (US)
Pages (from-to)5028-5040
Number of pages13
JournalStatistics in Medicine
Volume33
Issue number28
DOIs
StatePublished - Dec 10 2014
Externally publishedYes

Fingerprint

Pooling
Cluster Analysis
Regression
K-means Clustering
Costs and Cost Analysis
Linear Models
Costs
Budgets
Simple Linear Regression
Information Loss
Multiple Linear Regression
K-means Algorithm
K-means
Confidence Intervals
Linear Regression Model
Estimate
Clustering Algorithm
Confidence interval
Design
Strategy

Keywords

  • -means clustering
  • Biomarkers
  • Design
  • Pooling
  • Regression analysis

ASJC Scopus subject areas

  • Epidemiology
  • Statistics and Probability

Cite this

Mitchell, E. M., Lyles, R. H., Manatunga, A. K., Perkins, N. J., & Schisterman, E. F. (2014). A highly efficient design strategy for regression with outcome pooling. Statistics in Medicine, 33(28), 5028-5040. https://doi.org/10.1002/sim.6305

A highly efficient design strategy for regression with outcome pooling. / Mitchell, Emily M.; Lyles, Robert H.; Manatunga, Amita K.; Perkins, Neil J.; Schisterman, Enrique F.

In: Statistics in Medicine, Vol. 33, No. 28, 10.12.2014, p. 5028-5040.

Research output: Contribution to journalArticle

Mitchell, EM, Lyles, RH, Manatunga, AK, Perkins, NJ & Schisterman, EF 2014, 'A highly efficient design strategy for regression with outcome pooling', Statistics in Medicine, vol. 33, no. 28, pp. 5028-5040. https://doi.org/10.1002/sim.6305
Mitchell EM, Lyles RH, Manatunga AK, Perkins NJ, Schisterman EF. A highly efficient design strategy for regression with outcome pooling. Statistics in Medicine. 2014 Dec 10;33(28):5028-5040. https://doi.org/10.1002/sim.6305
Mitchell, Emily M. ; Lyles, Robert H. ; Manatunga, Amita K. ; Perkins, Neil J. ; Schisterman, Enrique F. / A highly efficient design strategy for regression with outcome pooling. In: Statistics in Medicine. 2014 ; Vol. 33, No. 28. pp. 5028-5040.
@article{80e99349b98c4a76aa61274d76b430da,
title = "A highly efficient design strategy for regression with outcome pooling",
abstract = "The potential for research involving biospecimens can be hindered by the prohibitive cost of performing laboratory assays on individual samples. To mitigate this cost, strategies such as randomly selecting a portion of specimens for analysis or randomly pooling specimens prior to performing laboratory assays may be employed. These techniques, while effective in reducing cost, are often accompanied by a considerable loss of statistical efficiency. We propose a novel pooling strategy based on the k-means clustering algorithm to reduce laboratory costs while maintaining a high level of statistical efficiency when predictor variables are measured on all subjects, but the outcome of interest is assessed in pools. We perform simulations motivated by the BioCycle study to compare this k-means pooling strategy with current pooling and selection techniques under simple and multiple linear regression models. While all of the methods considered produce unbiased estimates and confidence intervals with appropriate coverage, pooling under k-means clustering provides the most precise estimates, closely approximating results from the full data and losing minimal precision as the total number of pools decreases. The benefits of k-means clustering evident in the simulation study are then applied to an analysis of the BioCycle dataset. In conclusion, when the number of lab tests is limited by budget, pooling specimens based on k-means clustering prior to performing lab assays can be an effective way to save money with minimal information loss in a regression setting.",
keywords = "-means clustering, Biomarkers, Design, Pooling, Regression analysis",
author = "Mitchell, {Emily M.} and Lyles, {Robert H.} and Manatunga, {Amita K.} and Perkins, {Neil J.} and Schisterman, {Enrique F.}",
year = "2014",
month = "12",
day = "10",
doi = "10.1002/sim.6305",
language = "English (US)",
volume = "33",
pages = "5028--5040",
journal = "Statistics in Medicine",
issn = "0277-6715",
publisher = "John Wiley and Sons Ltd",
number = "28",

}

TY - JOUR

T1 - A highly efficient design strategy for regression with outcome pooling

AU - Mitchell, Emily M.

AU - Lyles, Robert H.

AU - Manatunga, Amita K.

AU - Perkins, Neil J.

AU - Schisterman, Enrique F.

PY - 2014/12/10

Y1 - 2014/12/10

N2 - The potential for research involving biospecimens can be hindered by the prohibitive cost of performing laboratory assays on individual samples. To mitigate this cost, strategies such as randomly selecting a portion of specimens for analysis or randomly pooling specimens prior to performing laboratory assays may be employed. These techniques, while effective in reducing cost, are often accompanied by a considerable loss of statistical efficiency. We propose a novel pooling strategy based on the k-means clustering algorithm to reduce laboratory costs while maintaining a high level of statistical efficiency when predictor variables are measured on all subjects, but the outcome of interest is assessed in pools. We perform simulations motivated by the BioCycle study to compare this k-means pooling strategy with current pooling and selection techniques under simple and multiple linear regression models. While all of the methods considered produce unbiased estimates and confidence intervals with appropriate coverage, pooling under k-means clustering provides the most precise estimates, closely approximating results from the full data and losing minimal precision as the total number of pools decreases. The benefits of k-means clustering evident in the simulation study are then applied to an analysis of the BioCycle dataset. In conclusion, when the number of lab tests is limited by budget, pooling specimens based on k-means clustering prior to performing lab assays can be an effective way to save money with minimal information loss in a regression setting.

AB - The potential for research involving biospecimens can be hindered by the prohibitive cost of performing laboratory assays on individual samples. To mitigate this cost, strategies such as randomly selecting a portion of specimens for analysis or randomly pooling specimens prior to performing laboratory assays may be employed. These techniques, while effective in reducing cost, are often accompanied by a considerable loss of statistical efficiency. We propose a novel pooling strategy based on the k-means clustering algorithm to reduce laboratory costs while maintaining a high level of statistical efficiency when predictor variables are measured on all subjects, but the outcome of interest is assessed in pools. We perform simulations motivated by the BioCycle study to compare this k-means pooling strategy with current pooling and selection techniques under simple and multiple linear regression models. While all of the methods considered produce unbiased estimates and confidence intervals with appropriate coverage, pooling under k-means clustering provides the most precise estimates, closely approximating results from the full data and losing minimal precision as the total number of pools decreases. The benefits of k-means clustering evident in the simulation study are then applied to an analysis of the BioCycle dataset. In conclusion, when the number of lab tests is limited by budget, pooling specimens based on k-means clustering prior to performing lab assays can be an effective way to save money with minimal information loss in a regression setting.

KW - -means clustering

KW - Biomarkers

KW - Design

KW - Pooling

KW - Regression analysis

UR - http://www.scopus.com/inward/record.url?scp=84908886689&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84908886689&partnerID=8YFLogxK

U2 - 10.1002/sim.6305

DO - 10.1002/sim.6305

M3 - Article

C2 - 25220822

AN - SCOPUS:84908886689

VL - 33

SP - 5028

EP - 5040

JO - Statistics in Medicine

JF - Statistics in Medicine

SN - 0277-6715

IS - 28

ER -