Exploration, normalization, and summaries of high density oligonucleotide array probe level data.

Rafael A. Irizarry, Bridget Hobbs, Francois Collin, Yasmin D. Beazer-Barclay, Kristen J. Antonellis, Uwe Scherf, Terence P. Speed

Research output: Contribution to journalArticle

Abstract

In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth's Genetics Institute involving 95 HG-U95A human GeneChip arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip arrays. We display some familiar features of the perfect match and mismatch probe (PM and MM) values of these data, and examine the variance-mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix's (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities.

Original languageEnglish (US)
Pages (from-to)249-264
Number of pages16
JournalBiostatistics
Volume4
Issue number2
StatePublished - Apr 2003

Fingerprint

Oligonucleotide Probes
Oligonucleotide Array Sequence Analysis
Normalization
Probe
Aptitude
Genes
Noise
Linear Models
Gene Expression
Spike
Logic
Model-based
Gene
Multiplicative Model
Normalize
Evaluate
Differential Expression
Standard error
Affine transformation
Mouse

ASJC Scopus subject areas

  • Medicine(all)
  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Cite this

Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., & Speed, T. P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2), 249-264.

Exploration, normalization, and summaries of high density oligonucleotide array probe level data. / Irizarry, Rafael A.; Hobbs, Bridget; Collin, Francois; Beazer-Barclay, Yasmin D.; Antonellis, Kristen J.; Scherf, Uwe; Speed, Terence P.

In: Biostatistics, Vol. 4, No. 2, 04.2003, p. 249-264.

Research output: Contribution to journalArticle

Irizarry, RA, Hobbs, B, Collin, F, Beazer-Barclay, YD, Antonellis, KJ, Scherf, U & Speed, TP 2003, 'Exploration, normalization, and summaries of high density oligonucleotide array probe level data.', Biostatistics, vol. 4, no. 2, pp. 249-264.
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003 Apr;4(2):249-264.
Irizarry, Rafael A. ; Hobbs, Bridget ; Collin, Francois ; Beazer-Barclay, Yasmin D. ; Antonellis, Kristen J. ; Scherf, Uwe ; Speed, Terence P. / Exploration, normalization, and summaries of high density oligonucleotide array probe level data. In: Biostatistics. 2003 ; Vol. 4, No. 2. pp. 249-264.
@article{461f1e01eec24447a62b609fcc22d96b,
title = "Exploration, normalization, and summaries of high density oligonucleotide array probe level data.",
abstract = "In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth's Genetics Institute involving 95 HG-U95A human GeneChip arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip arrays. We display some familiar features of the perfect match and mismatch probe (PM and MM) values of these data, and examine the variance-mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix's (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities.",
author = "Irizarry, {Rafael A.} and Bridget Hobbs and Francois Collin and Beazer-Barclay, {Yasmin D.} and Antonellis, {Kristen J.} and Uwe Scherf and Speed, {Terence P.}",
year = "2003",
month = "4",
language = "English (US)",
volume = "4",
pages = "249--264",
journal = "Biostatistics",
issn = "1465-4644",
publisher = "Oxford University Press",
number = "2",

}

TY - JOUR

T1 - Exploration, normalization, and summaries of high density oligonucleotide array probe level data.

AU - Irizarry, Rafael A.

AU - Hobbs, Bridget

AU - Collin, Francois

AU - Beazer-Barclay, Yasmin D.

AU - Antonellis, Kristen J.

AU - Scherf, Uwe

AU - Speed, Terence P.

PY - 2003/4

Y1 - 2003/4

N2 - In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth's Genetics Institute involving 95 HG-U95A human GeneChip arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip arrays. We display some familiar features of the perfect match and mismatch probe (PM and MM) values of these data, and examine the variance-mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix's (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities.

AB - In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth's Genetics Institute involving 95 HG-U95A human GeneChip arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip arrays. We display some familiar features of the perfect match and mismatch probe (PM and MM) values of these data, and examine the variance-mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix's (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities.

UR - http://www.scopus.com/inward/record.url?scp=0142121516&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0142121516&partnerID=8YFLogxK

M3 - Article

C2 - 12925520

AN - SCOPUS:0142121516

VL - 4

SP - 249

EP - 264

JO - Biostatistics

JF - Biostatistics

SN - 1465-4644

IS - 2

ER -