Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million

Aaron Fisher; Brian Caffo; Brian Schwartz; Vadim Zipunnikov

doi:10.1080/01621459.2015.1062383

Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million

Aaron Fisher, Brian Caffo, Brian Schwartz, Vadim Zipunnikov

Bloomberg School of Public Health

Research output: Contribution to journal › Article › peer-review

8 Scopus citations

Abstract

Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.

Original language	English (US)
Pages (from-to)	846-860
Number of pages	15
Journal	Journal of the American Statistical Association
Volume	111
Issue number	514
DOIs	https://doi.org/10.1080/01621459.2015.1062383
State	Published - Apr 2 2016

Keywords

Functional data analysis
Image analysis
PCA
SVD
Singular value decomposition

ASJC Scopus subject areas

Statistics and Probability
Statistics, Probability and Uncertainty

Access to Document

10.1080/01621459.2015.1062383

Cite this

@article{0475072fc6e14bd98d788af572a4036d,

title = "Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million",

abstract = "Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.",

keywords = "Functional data analysis, Image analysis, PCA, SVD, Singular value decomposition",

author = "Aaron Fisher and Brian Caffo and Brian Schwartz and Vadim Zipunnikov",

note = "Publisher Copyright: {\textcopyright} 2016, {\textcopyright} American Statistical Association.",

year = "2016",

month = apr,

day = "2",

doi = "10.1080/01621459.2015.1062383",

language = "English (US)",

volume = "111",

pages = "846--860",

journal = "Journal of the American Statistical Association",

issn = "0162-1459",

publisher = "Taylor and Francis Ltd.",

number = "514",

}

TY - JOUR

T1 - Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million

AU - Fisher, Aaron

AU - Caffo, Brian

AU - Schwartz, Brian

AU - Zipunnikov, Vadim

PY - 2016/4/2

Y1 - 2016/4/2

N2 - Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.

AB - Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.

KW - Functional data analysis

KW - Image analysis

KW - PCA

KW - SVD

KW - Singular value decomposition

UR - http://www.scopus.com/inward/record.url?scp=84983331305&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84983331305&partnerID=8YFLogxK

U2 - 10.1080/01621459.2015.1062383

DO - 10.1080/01621459.2015.1062383

M3 - Article

C2 - 27616801

AN - SCOPUS:84983331305

SN - 0162-1459

VL - 111

SP - 846

EP - 860

JO - Journal of the American Statistical Association

JF - Journal of the American Statistical Association

IS - 514

ER -

Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this