Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million

Research output: Contribution to journalArticle

Abstract

Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.

Original languageEnglish (US)
Pages (from-to)846-860
Number of pages15
JournalJournal of the American Statistical Association
Volume111
Issue number514
DOIs
StatePublished - Apr 2 2016

Fingerprint

Bootstrap
Principal Component Analysis
Principal Components
Magnetic Resonance Image
Subspace
n-dimensional
Principal component analysis
Sleep
Standard error
Leverage
Eigenvalue
Uncertainty
Metric
Principal components

Keywords

  • Functional data analysis
  • Image analysis
  • PCA
  • Singular value decomposition
  • SVD

ASJC Scopus subject areas

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Cite this

Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million. / Fisher, Aaron; Caffo, Brian S; Schwartz, Brian S; Zipunnikov, Vadim.

In: Journal of the American Statistical Association, Vol. 111, No. 514, 02.04.2016, p. 846-860.

Research output: Contribution to journalArticle

@article{0475072fc6e14bd98d788af572a4036d,
title = "Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million",
abstract = "Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.",
keywords = "Functional data analysis, Image analysis, PCA, Singular value decomposition, SVD",
author = "Aaron Fisher and Caffo, {Brian S} and Schwartz, {Brian S} and Vadim Zipunnikov",
year = "2016",
month = "4",
day = "2",
doi = "10.1080/01621459.2015.1062383",
language = "English (US)",
volume = "111",
pages = "846--860",
journal = "Journal of the American Statistical Association",
issn = "0162-1459",
publisher = "Taylor and Francis Ltd.",
number = "514",

}

TY - JOUR

T1 - Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million

AU - Fisher, Aaron

AU - Caffo, Brian S

AU - Schwartz, Brian S

AU - Zipunnikov, Vadim

PY - 2016/4/2

Y1 - 2016/4/2

N2 - Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.

AB - Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.

KW - Functional data analysis

KW - Image analysis

KW - PCA

KW - Singular value decomposition

KW - SVD

UR - http://www.scopus.com/inward/record.url?scp=84983331305&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84983331305&partnerID=8YFLogxK

U2 - 10.1080/01621459.2015.1062383

DO - 10.1080/01621459.2015.1062383

M3 - Article

C2 - 27616801

AN - SCOPUS:84983331305

VL - 111

SP - 846

EP - 860

JO - Journal of the American Statistical Association

JF - Journal of the American Statistical Association

SN - 0162-1459

IS - 514

ER -