GET THE APP

Journal of Proteomics & Bioinformatics

Journal of Proteomics & Bioinformatics
Open Access

ISSN: 0974-276X

Abstract

Selection of Significant Clusters of Genes based on Ensemble Clustering and Recursive Cluster Elimination (RCE)

Loai Abdallah, Waleed Khalifa, Louise C Showe and Malik Yousef

Background: Advances in technology have facilitated the generation of gene expression data from large numbers of samples and the development of “Big Data” approaches to analysing gene expression in basic and biomedical systems. That being said, the data still includes relatively small numbers of samples and tens of thousands of variables/gene expression. A variety of different approaches have been developed for searching these gene spaces in order to select the most informative variables that can accurately distinguish one class of subjects/ samples from another. However, there is still a need for new approaches that can accurately distinguish biologically different classes of subjects with similar gene expression profiles. We describe a new and promising approach for selecting the most informative differentially expressed genes that addresses this problem. We describe a method for identifying significant differentially expressed clusters of genes using a process of Recursive Cluster Elimination (RCE) that is based on an ensemble clustering approach. We refer to this approach as SVM-RCE-EC (Ensemble Clustering). We show that SVM-RCE-EC improves gene selection, classification accuracy as compared to other methods including the traditional SVM-RCE approach, and that this is particularly evident when applied to difficult data sets that are poorly resolved by other approaches.

Methods: To implement SVM-RCE-EC we first applied an ensemble-clustering method, to identify robust gene clusters. We then applied Support Vector Machines (SVMs), with cross validation to score (rank) those clusters of genes based on their contributions to classification accuracy. The clusters of genes that are least significant are progressively removed by the procedure of RCE with the most significant clusters being retained until one identifies the most robust, significantly differentially expressed genes between the two classes. We compare the classification performance of SVM-RCE-EC to a variety of published classification algorithms.

Results and Conclusion: Utilization of gene clusters selected using the ensemble method enhances classification performance as compared to other methods and identifies sets of significant genes that appear to be more biologically meaningful to the system being analyzed. We show that SVM-RCE-EC outperforms several other methods on data that represent highly similar sample classes that are difficult to distinguish and is comparable to other methods when applied to data where the classes are more easily separated. The improved performance of SVM-RCE-EC on difficult data sets is likely due to the fact that the significant clusters, as determined by the ensemble approach, capture the native structure of the data while SVM-RCE leaves that determination to the user. This hypothesis is supported by the observations that the performance of the clusters generated by SVM-RCE-EC is more robust.

Availability: The Matlab version of SVM-RCE-EC is available upon request to the first author and at GitHub (https://github.com/malikyousef/svm-rce-ec).

Top