### Abstract

In order to describe multiclass classification performance, several figures of merit (FOM) have been proposed. Among the earliest and most widely known of these is the three-class Hotelling trace (3-HT). The goal of this paper is to present theoretical and empirical data demonstrating the failure of 3-HT as a measure of three-class task performance. To help do this, we contrast it to a newly proposed three-class FOM, the volume under the three-class receiver operating characteristic (ROC) surface (VUS). The VUS is obtained from a decision theory based three-class ROC analysis method which has been proved to extend the decision theoretic, linear discriminant analysis (LDA), and psychophysical foundations of binary ROC analysis to a three-class paradigm. We demonstrate empirically that the VUS and 3-HT do not have a monotonic relationship in general when describing three-class task performance. Numerical experiments demonstrated that the VUS provided reasonable results, while the 3-HT failed to distinguish between the case where all objects could be perfectly classified from the case where only one pair of the classes could be perfectly classified. We have provided theoretical explanations of this failure of 3-HT. The significance of this work goes beyond merely demonstrating the problems of the 3-HT, it demonstrates that a FOM that is mathematically correct and has a strong theoretical basis can provide results that violate a common sense understanding of three-class task performance. This fact raises the question of "how to evaluate a classification performance evaluation method?" We believe the answer to this question lies in the theoretical foundations of binary ROC analysis. We have thus contrasted the two FOMs in terms of three fundamental theories underlying binary ROC analysis: decision theory, binary linear discriminant analysis, and the equivalence of two psychophysical classification procedures. These theoretical investigations demonstrated the importance of extending and unifying all the fundamental theories of binary classification in the development of a three-class FOM; violating one of theses fundamental binary classification theories may, as it did for the L-HT, provide predictions of three-class task performance that do not agree with a common sense understanding of three-class task performance.

Original language | English (US) |
---|---|

Article number | 4580126 |

Pages (from-to) | 185-193 |

Number of pages | 9 |

Journal | IEEE Transactions on Medical Imaging |

Volume | 28 |

Issue number | 2 |

DOIs | |

State | Published - Feb 2009 |

### Fingerprint

### Keywords

- L-class Hotelling trace
- L-class linear discriminant analysis
- Receiver operating characteristic (ROC) analysis
- Three-class classification

### ASJC Scopus subject areas

- Electrical and Electronic Engineering
- Computer Science Applications
- Radiological and Ultrasound Technology
- Software

### Cite this

**The validity of three-class Hotelling trace (3-HT) in describing three-class task performance : Comparison of three-class volume under ROC surface (VUS) and 3-HT.** / He, Xin; Frey, Eric.

Research output: Contribution to journal › Article

*IEEE Transactions on Medical Imaging*, vol. 28, no. 2, 4580126, pp. 185-193. https://doi.org/10.1109/TMI.2008.928919

}

TY - JOUR

T1 - The validity of three-class Hotelling trace (3-HT) in describing three-class task performance

T2 - Comparison of three-class volume under ROC surface (VUS) and 3-HT

AU - He, Xin

AU - Frey, Eric

PY - 2009/2

Y1 - 2009/2

N2 - In order to describe multiclass classification performance, several figures of merit (FOM) have been proposed. Among the earliest and most widely known of these is the three-class Hotelling trace (3-HT). The goal of this paper is to present theoretical and empirical data demonstrating the failure of 3-HT as a measure of three-class task performance. To help do this, we contrast it to a newly proposed three-class FOM, the volume under the three-class receiver operating characteristic (ROC) surface (VUS). The VUS is obtained from a decision theory based three-class ROC analysis method which has been proved to extend the decision theoretic, linear discriminant analysis (LDA), and psychophysical foundations of binary ROC analysis to a three-class paradigm. We demonstrate empirically that the VUS and 3-HT do not have a monotonic relationship in general when describing three-class task performance. Numerical experiments demonstrated that the VUS provided reasonable results, while the 3-HT failed to distinguish between the case where all objects could be perfectly classified from the case where only one pair of the classes could be perfectly classified. We have provided theoretical explanations of this failure of 3-HT. The significance of this work goes beyond merely demonstrating the problems of the 3-HT, it demonstrates that a FOM that is mathematically correct and has a strong theoretical basis can provide results that violate a common sense understanding of three-class task performance. This fact raises the question of "how to evaluate a classification performance evaluation method?" We believe the answer to this question lies in the theoretical foundations of binary ROC analysis. We have thus contrasted the two FOMs in terms of three fundamental theories underlying binary ROC analysis: decision theory, binary linear discriminant analysis, and the equivalence of two psychophysical classification procedures. These theoretical investigations demonstrated the importance of extending and unifying all the fundamental theories of binary classification in the development of a three-class FOM; violating one of theses fundamental binary classification theories may, as it did for the L-HT, provide predictions of three-class task performance that do not agree with a common sense understanding of three-class task performance.

AB - In order to describe multiclass classification performance, several figures of merit (FOM) have been proposed. Among the earliest and most widely known of these is the three-class Hotelling trace (3-HT). The goal of this paper is to present theoretical and empirical data demonstrating the failure of 3-HT as a measure of three-class task performance. To help do this, we contrast it to a newly proposed three-class FOM, the volume under the three-class receiver operating characteristic (ROC) surface (VUS). The VUS is obtained from a decision theory based three-class ROC analysis method which has been proved to extend the decision theoretic, linear discriminant analysis (LDA), and psychophysical foundations of binary ROC analysis to a three-class paradigm. We demonstrate empirically that the VUS and 3-HT do not have a monotonic relationship in general when describing three-class task performance. Numerical experiments demonstrated that the VUS provided reasonable results, while the 3-HT failed to distinguish between the case where all objects could be perfectly classified from the case where only one pair of the classes could be perfectly classified. We have provided theoretical explanations of this failure of 3-HT. The significance of this work goes beyond merely demonstrating the problems of the 3-HT, it demonstrates that a FOM that is mathematically correct and has a strong theoretical basis can provide results that violate a common sense understanding of three-class task performance. This fact raises the question of "how to evaluate a classification performance evaluation method?" We believe the answer to this question lies in the theoretical foundations of binary ROC analysis. We have thus contrasted the two FOMs in terms of three fundamental theories underlying binary ROC analysis: decision theory, binary linear discriminant analysis, and the equivalence of two psychophysical classification procedures. These theoretical investigations demonstrated the importance of extending and unifying all the fundamental theories of binary classification in the development of a three-class FOM; violating one of theses fundamental binary classification theories may, as it did for the L-HT, provide predictions of three-class task performance that do not agree with a common sense understanding of three-class task performance.

KW - L-class Hotelling trace

KW - L-class linear discriminant analysis

KW - Receiver operating characteristic (ROC) analysis

KW - Three-class classification

UR - http://www.scopus.com/inward/record.url?scp=59449095907&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=59449095907&partnerID=8YFLogxK

U2 - 10.1109/TMI.2008.928919

DO - 10.1109/TMI.2008.928919

M3 - Article

VL - 28

SP - 185

EP - 193

JO - IEEE Transactions on Medical Imaging

JF - IEEE Transactions on Medical Imaging

SN - 0278-0062

IS - 2

M1 - 4580126

ER -