Efficient selection of feature sets possessing high coefficients of determination based on incremental determinations

Ronaldo F. Hashimoto, Edward R. Dougherty, Marcel Brun, Zheng Zheng Zhou, Michael L. Bittner, Jeffrey M. Trent

Research output: Contribution to journalArticle

Abstract

Feature selection is problematic when the number of potential features is very large. Absent distribution knowledge, to select a best feature set of a certain size requires that all feature sets of that size be examined. This paper considers the question in the context of variable selection for prediction based on the coefficient of determination (CoD). The CoD varies between 0 and 1, and measures the degree to which prediction is improved by using the features relative to prediction in the absence of the features. It examines the following heuristic: if we wish to find feature sets of size m with CoD exceeding δ, what is the effect of only considering a feature set if it contains a subset with CoD exceeding λ <δ? This means that if the subsets do not possess sufficiently high CoD, then it is assumed that the feature set itself cannot possess the required CoD. As it stands, the heuristic cannot be applied since one would have to know the CoDs beforehand. It is meaningfully posed by assuming a prior distribution on the CoDs. Then one can pose the question in a Bayesian framework by considering the probability P(θ > δ|max{θ12,⋯,θv } <λ), where θ is the CoD of the feature set and θ12,⋯,θv are the CoDs of the subsets. Such probabilities allow a rigorous analysis of the following decision procedure: the feature set is examined if max{θ12,⋯,θv} ≥ λ. Computational saving increases as λ increases, but the probability of missing desirable feature sets increases as the increment δ-λ decreases; conversely, computational saving goes down as λ decreases, but the probability of missing desirable feature sets decreases as δ-λ increases. The paper considers various loss measures pertaining to omitting feature sets based on the criteria. After specializing the matter to binary features, it considers a simulation model, and then applies the theory in the context of microarray-based genomic CoD analysis. It also provides optimal computational algorithms.

Original languageEnglish (US)
Pages (from-to)695-712
Number of pages18
JournalSignal Processing
Volume83
Issue number4
DOIs
StatePublished - Apr 2003
Externally publishedYes

Fingerprint

Microarrays
Set theory
Feature extraction

Keywords

  • Coefficient of determination
  • Feature selection
  • Gene microarray
  • Optimal classifier

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Efficient selection of feature sets possessing high coefficients of determination based on incremental determinations. / Hashimoto, Ronaldo F.; Dougherty, Edward R.; Brun, Marcel; Zhou, Zheng Zheng; Bittner, Michael L.; Trent, Jeffrey M.

In: Signal Processing, Vol. 83, No. 4, 04.2003, p. 695-712.

Research output: Contribution to journalArticle

Hashimoto, Ronaldo F. ; Dougherty, Edward R. ; Brun, Marcel ; Zhou, Zheng Zheng ; Bittner, Michael L. ; Trent, Jeffrey M. / Efficient selection of feature sets possessing high coefficients of determination based on incremental determinations. In: Signal Processing. 2003 ; Vol. 83, No. 4. pp. 695-712.
@article{539fcbcf9120451db02e6a704622e82e,
title = "Efficient selection of feature sets possessing high coefficients of determination based on incremental determinations",
abstract = "Feature selection is problematic when the number of potential features is very large. Absent distribution knowledge, to select a best feature set of a certain size requires that all feature sets of that size be examined. This paper considers the question in the context of variable selection for prediction based on the coefficient of determination (CoD). The CoD varies between 0 and 1, and measures the degree to which prediction is improved by using the features relative to prediction in the absence of the features. It examines the following heuristic: if we wish to find feature sets of size m with CoD exceeding δ, what is the effect of only considering a feature set if it contains a subset with CoD exceeding λ <δ? This means that if the subsets do not possess sufficiently high CoD, then it is assumed that the feature set itself cannot possess the required CoD. As it stands, the heuristic cannot be applied since one would have to know the CoDs beforehand. It is meaningfully posed by assuming a prior distribution on the CoDs. Then one can pose the question in a Bayesian framework by considering the probability P(θ > δ|max{θ1,θ2,⋯,θv } <λ), where θ is the CoD of the feature set and θ1,θ2,⋯,θv are the CoDs of the subsets. Such probabilities allow a rigorous analysis of the following decision procedure: the feature set is examined if max{θ1,θ2,⋯,θv} ≥ λ. Computational saving increases as λ increases, but the probability of missing desirable feature sets increases as the increment δ-λ decreases; conversely, computational saving goes down as λ decreases, but the probability of missing desirable feature sets decreases as δ-λ increases. The paper considers various loss measures pertaining to omitting feature sets based on the criteria. After specializing the matter to binary features, it considers a simulation model, and then applies the theory in the context of microarray-based genomic CoD analysis. It also provides optimal computational algorithms.",
keywords = "Coefficient of determination, Feature selection, Gene microarray, Optimal classifier",
author = "Hashimoto, {Ronaldo F.} and Dougherty, {Edward R.} and Marcel Brun and Zhou, {Zheng Zheng} and Bittner, {Michael L.} and Trent, {Jeffrey M.}",
year = "2003",
month = "4",
doi = "10.1016/S0165-1684(02)00468-1",
language = "English (US)",
volume = "83",
pages = "695--712",
journal = "Signal Processing",
issn = "0165-1684",
publisher = "Elsevier",
number = "4",

}

TY - JOUR

T1 - Efficient selection of feature sets possessing high coefficients of determination based on incremental determinations

AU - Hashimoto, Ronaldo F.

AU - Dougherty, Edward R.

AU - Brun, Marcel

AU - Zhou, Zheng Zheng

AU - Bittner, Michael L.

AU - Trent, Jeffrey M.

PY - 2003/4

Y1 - 2003/4

N2 - Feature selection is problematic when the number of potential features is very large. Absent distribution knowledge, to select a best feature set of a certain size requires that all feature sets of that size be examined. This paper considers the question in the context of variable selection for prediction based on the coefficient of determination (CoD). The CoD varies between 0 and 1, and measures the degree to which prediction is improved by using the features relative to prediction in the absence of the features. It examines the following heuristic: if we wish to find feature sets of size m with CoD exceeding δ, what is the effect of only considering a feature set if it contains a subset with CoD exceeding λ <δ? This means that if the subsets do not possess sufficiently high CoD, then it is assumed that the feature set itself cannot possess the required CoD. As it stands, the heuristic cannot be applied since one would have to know the CoDs beforehand. It is meaningfully posed by assuming a prior distribution on the CoDs. Then one can pose the question in a Bayesian framework by considering the probability P(θ > δ|max{θ1,θ2,⋯,θv } <λ), where θ is the CoD of the feature set and θ1,θ2,⋯,θv are the CoDs of the subsets. Such probabilities allow a rigorous analysis of the following decision procedure: the feature set is examined if max{θ1,θ2,⋯,θv} ≥ λ. Computational saving increases as λ increases, but the probability of missing desirable feature sets increases as the increment δ-λ decreases; conversely, computational saving goes down as λ decreases, but the probability of missing desirable feature sets decreases as δ-λ increases. The paper considers various loss measures pertaining to omitting feature sets based on the criteria. After specializing the matter to binary features, it considers a simulation model, and then applies the theory in the context of microarray-based genomic CoD analysis. It also provides optimal computational algorithms.

AB - Feature selection is problematic when the number of potential features is very large. Absent distribution knowledge, to select a best feature set of a certain size requires that all feature sets of that size be examined. This paper considers the question in the context of variable selection for prediction based on the coefficient of determination (CoD). The CoD varies between 0 and 1, and measures the degree to which prediction is improved by using the features relative to prediction in the absence of the features. It examines the following heuristic: if we wish to find feature sets of size m with CoD exceeding δ, what is the effect of only considering a feature set if it contains a subset with CoD exceeding λ <δ? This means that if the subsets do not possess sufficiently high CoD, then it is assumed that the feature set itself cannot possess the required CoD. As it stands, the heuristic cannot be applied since one would have to know the CoDs beforehand. It is meaningfully posed by assuming a prior distribution on the CoDs. Then one can pose the question in a Bayesian framework by considering the probability P(θ > δ|max{θ1,θ2,⋯,θv } <λ), where θ is the CoD of the feature set and θ1,θ2,⋯,θv are the CoDs of the subsets. Such probabilities allow a rigorous analysis of the following decision procedure: the feature set is examined if max{θ1,θ2,⋯,θv} ≥ λ. Computational saving increases as λ increases, but the probability of missing desirable feature sets increases as the increment δ-λ decreases; conversely, computational saving goes down as λ decreases, but the probability of missing desirable feature sets decreases as δ-λ increases. The paper considers various loss measures pertaining to omitting feature sets based on the criteria. After specializing the matter to binary features, it considers a simulation model, and then applies the theory in the context of microarray-based genomic CoD analysis. It also provides optimal computational algorithms.

KW - Coefficient of determination

KW - Feature selection

KW - Gene microarray

KW - Optimal classifier

UR - http://www.scopus.com/inward/record.url?scp=0037399780&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037399780&partnerID=8YFLogxK

U2 - 10.1016/S0165-1684(02)00468-1

DO - 10.1016/S0165-1684(02)00468-1

M3 - Article

AN - SCOPUS:0037399780

VL - 83

SP - 695

EP - 712

JO - Signal Processing

JF - Signal Processing

SN - 0165-1684

IS - 4

ER -