Strong feature sets from small samples

Seungchan Kim, Edward R. Dougherty, Junior Barrera, Yidong Chen, Michael L. Bittner, Jeffrey M. Trent

Research output: Contribution to journalArticle

Abstract

For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.

Original languageEnglish (US)
Pages (from-to)127-146
Number of pages20
JournalJournal of Computational Biology
Volume9
Issue number1
DOIs
StatePublished - 2002
Externally publishedYes

Fingerprint

Small Sample
Classifier
Classifiers
Gene
Genes
CDNA Microarray
Cancer Classification
BRCA2 Gene
Sample point
BRCA1 Gene
Overfitting
Error Estimator
Algorithm Design
Breast Cancer
Microarrays
Tumor
Oligonucleotide Array Sequence Analysis
Probability Distribution
Probability distributions
Classify

Keywords

  • Cancer
  • Classification
  • Gene expression
  • Perceptron

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics

Cite this

Kim, S., Dougherty, E. R., Barrera, J., Chen, Y., Bittner, M. L., & Trent, J. M. (2002). Strong feature sets from small samples. Journal of Computational Biology, 9(1), 127-146. https://doi.org/10.1089/10665270252833226

Strong feature sets from small samples. / Kim, Seungchan; Dougherty, Edward R.; Barrera, Junior; Chen, Yidong; Bittner, Michael L.; Trent, Jeffrey M.

In: Journal of Computational Biology, Vol. 9, No. 1, 2002, p. 127-146.

Research output: Contribution to journalArticle

Kim, S, Dougherty, ER, Barrera, J, Chen, Y, Bittner, ML & Trent, JM 2002, 'Strong feature sets from small samples', Journal of Computational Biology, vol. 9, no. 1, pp. 127-146. https://doi.org/10.1089/10665270252833226
Kim S, Dougherty ER, Barrera J, Chen Y, Bittner ML, Trent JM. Strong feature sets from small samples. Journal of Computational Biology. 2002;9(1):127-146. https://doi.org/10.1089/10665270252833226
Kim, Seungchan ; Dougherty, Edward R. ; Barrera, Junior ; Chen, Yidong ; Bittner, Michael L. ; Trent, Jeffrey M. / Strong feature sets from small samples. In: Journal of Computational Biology. 2002 ; Vol. 9, No. 1. pp. 127-146.
@article{916c1090437a41c8a608627abcd0bc74,
title = "Strong feature sets from small samples",
abstract = "For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.",
keywords = "Cancer, Classification, Gene expression, Perceptron",
author = "Seungchan Kim and Dougherty, {Edward R.} and Junior Barrera and Yidong Chen and Bittner, {Michael L.} and Trent, {Jeffrey M.}",
year = "2002",
doi = "10.1089/10665270252833226",
language = "English (US)",
volume = "9",
pages = "127--146",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "1",

}

TY - JOUR

T1 - Strong feature sets from small samples

AU - Kim, Seungchan

AU - Dougherty, Edward R.

AU - Barrera, Junior

AU - Chen, Yidong

AU - Bittner, Michael L.

AU - Trent, Jeffrey M.

PY - 2002

Y1 - 2002

N2 - For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.

AB - For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.

KW - Cancer

KW - Classification

KW - Gene expression

KW - Perceptron

UR - http://www.scopus.com/inward/record.url?scp=0036211375&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036211375&partnerID=8YFLogxK

U2 - 10.1089/10665270252833226

DO - 10.1089/10665270252833226

M3 - Article

C2 - 11911798

AN - SCOPUS:0036211375

VL - 9

SP - 127

EP - 146

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 1

ER -