Identifying Differential Gene Sets using the Linear Combination of Genes with Maximum AUC

Zhanfeng Wang; Chen-An Tsai; Yuan-chin I Chang

doi:10.4172/jpb.1000216

Abstract

Identifying Differential Gene Sets using the Linear Combination of Genes with Maximum AUC

Zhanfeng Wang, Chen-An Tsai and Yuan-chin I Chang

Gene Set Enrichment Analysis (GSEA) utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or prior defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes and are the most widely used method for gene analysis. However, little attention has been given from a classification prospect. In this paper, we identify the differential gene sets, which are strongly associated with phenotypic class distinction ability, using gene expression data together with prior biological knowledge. We propose two non-parametric methods to identify differential gene sets using the area under the receiver operating characteristic (ROC) curve (AUC) of linear risk scores of gene sets, which are obtained through a parsimonious threshold-independent gene selection method within gene sets. The AUC-based statistics and the AUC values obtained from cross-validation of the linear risk scores are calculated, and used as indexes to identify differential gene sets. The discrimination abilities of gene sets are summarized and gene sets that possess discrimination ability are selected via a prescribed AUC statistic threshold or a predefined cross-validation AUC threshold. Moreover, we further distinguish the impacts of individual gene sets in terms of discrimination ability based on the absolute values of linear combination coefficients. The proposed methods allow investigators to identify enriched gene sets with high discrimination ability and discover the contributions of genes within gene set via the corresponding linear combination coefficients. Both numerical studies using synthesized data and a series of gene expression data sets are conducted to evaluate the performance of the proposed methods, and the results are compared to the random forests classification method and other hypotheses testing based approaches. The results show that our proposed methods are reliable and satisfactory in detecting enrichment and can provide an insightful alternative to gene set testing. The R script and supplementary information are available at http://idv.sinica.edu.tw/ ycchang/software.html.

Journal of Proteomics & BioinformaticsOpen Access

Abstract

Identifying Differential Gene Sets using the Linear Combination of Genes with Maximum AUC

Journal of Proteomics & Bioinformatics
Open Access