Biclustering Heatmap

From FarsightWiki
Jump to: navigation, search

Contents

Introduction

Bi-clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix(Here rows represent samples and columns represent features).

We use harmonic analysis for each direction which is suited to the geometrical structure or high dimensional Euclidean settings.

The heatmap generalized is a useful visualization tool to visualize the result of the clustered data.


Hierachical Tree representation

When treating the rows as points of the data set, we build a hierachical partition tree base on it.

The data set could be represent as a set which contains all single elements of it. We apply k-means algorithm on it to obtain a higher level set ,the element of each higher set is the centroid of the cluster. By doing this iteratively, finally we get a single node which is the highest level of the partition tree, its only element is the centroid of the whole data set.

A typical partition tree is shown as the figure on the right

K-means Clustering

K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function.

Coupled Geometry

Based on the MSTs, Earth Mover's Distance(EMD) method is used for MSTs and modules to see how each MST fits each module. If the value is large, it means the MST can well present the module progression to some extent. For this step, just click on EMD button when available, it will match all MSTs and modules. The running state is updated in command window.

Heatmap

Once the matching among MSTs and modules has finished, a similarity matrix for modules is built. PSM threshold, progression similarity matrix threhold, is set to determine whether two modules are similar. The similarity is one if the module is matching with its own MST. It approaches zero if the MST doesn't fit the module at all. "Selecting percentage" is the percentage of the values higher than the threshold in the similarity matrix. This param depends on the way you trust the similarity judgement. If you trust the similarity value more, just focus on the threhold; If you trust the percentage more, set the threhold so as to let the percentage fall in the range you would like it to. Click Show PSM, a heatmap would pop out for you to select the similar modules. The heatmap is colored from red to blue. Dark red represents high values and dark blue low values. It's symetric so it's recommended to select along the diagonal by left button down clicking on the left-up corner starting block and left button up clicking on the right-down corner ending block. All the blocks in this symetric square will be chosen and their corresponding module IDs will be filled in the "Input hand-picked modules".

Recommended param setting for :

Set the threshold so that the percentage is 0.2 to 0.3.

Progression Tree

Beyond the auto-filling module IDs by right clicking on the heatmap, you can also edit them yourself. If you want to add IDs, make sure they are seperated by comma. After these are all set, click View Progression, the final progression tree is built. You can select the nodes and view the corresponding items in other views. If the samples have been clustered, a vertex represents a cluster with its size number shown near the vertex.

Progression Heatmap

Progression Heatmap is built from the progression tree. Its row order is the tree node order, and its column order is the feature cluster order, the corresponding hierachical clustering dendrogram is shown above the heatmap. The heatmap is colored based on the normalized feature values. When you click vertex in the progression tree, the corresponding rows and selected feature columns will be selected.