Abstract
The identification of cancer-related genes is important towards the understanding of complex genetic diseases. Although many machine learning algorithms are proposed to identify disease-related genes, they often either have poor performance to identify locus heterogeneity cancer-related genes or are not applicable to predict individual-disease-related genes due to the lack of positive instances (imbalanced classification). To overcome these two issues, a two-step logistic regression (LR) based algorithm is proposed in this study for identifying individual-cancer-related genes. A set of high potential cancer-class-related genes is first generated in step 1, followed by a second round of LR-based algorithm conducted on this smaller dataset for identifying individual-cancer-related genes. Numerical experiments show that the proposed two-step LR-based algorithm not only works well for locus heterogeneity data, but also has good performance to handle the imbalanced classification problem. The individual-cancer-related gene identification experiments achieve AUC values of around 0.85 when the threshold of posterior probability is chosen between 0.3 and 0.6. All evaluations are conducted by using the leave-one-out cross validation method.