Abstract
Gene expression data from microarrays have been successfully applied to class prediction, where the purpose is to classify and predict the diagnostic category of a sample by its gene expression profile. A typical microarray dataset consists of expression levels for a large number of genes on a relatively small number of samples. As a consequence, one basic and important question associated with class prediction is: how do we identify a small subset of informative genes contributing the most to the classification task? Many methods have been proposed but most focus on two-class problems, such as discrimination between normal and disease samples. This paper addresses selecting informative genes for multi-class prediction problems by jointly considering all the classes simultaneously. Our approach is based on the power of the genes in discriminating among the different classes (e.g., tumor types) and the existing correlation between genes. We formulate the expression levels of a given gene by a one-way analysis of variance model with heterogeneity of variances, and determine the discriminatory power of the gene by a test statistic designed to test the equality of the class means. In other words, the discriminatory power of a gene is associated with a Behrens-Fisher problem. Informative genes are chosen such that each selected gene has a high discriminatory power and the correlation between any pair of selected genes is low. Test statistics considered in this paper include the ANOVA F test statistic, the Brown-Forsythe test statistic, the Cochran test statistic, and the Welch test statistic. Their performances are evaluated over several classifiication methods applied to two publicly available microarray datasets. The results show that Brown-Forsythe test statistic achieves the best performance.