STrenD: Subspace Trend Discovery

From FarsightWiki
(Difference between revisions)
Jump to: navigation, search
 
(49 intermediate revisions by one user not shown)
Line 1: Line 1:
== Clustering ==
+
== Description ==
Clustering is for data dimension deduction to speed up the analysis and to achieve better looking progression tree. The clustering has been undertaken on boths sides, samples and features.  
+
The goal of this project is to develop unsupervised algorithms for discovering previously unknown subspace trends in high-dimensional data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. We aim to efficiently select trend-relevant features and derive meaningful 2-D or 3-D visualizations. The proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.
  
For sample cluster, only one param "coherence" is taken into consideration. "Coherence" is mesured by the average Pearson correlation coefficient of each module. Therefore, it should be 0-1. The larger the coherence, the more correlated the module.
+
Please find more details in the following paper:
For feature cluster, besides "coherence", "merge coherence" is the Pearson correlation coefficient of two clustered modules, if the correlation coeffient of two modules exceeds the "merge coherence", these two modules will be merged. Its arrange is also 0-1. Click '''Sample cluster''' for sample clustering and '''Feature cluster''' for feature clustering.
+
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7015603
  
'''Recommended param setting:'''
+
== Software Interface ==
  
Samples clustering: coherence = 0.95; It is not neccessary for small sample size.
+
[[File:STrendInterface.png|800px |thumb|center| '''Fig. 1''' Software interface]]
  
Feature clustering: coherence = 0.9 or higher for small feature size, 0.7 for large feature size; merge coherence = 0.9.
+
1. Load Tab-delimited txt file. If columns are features and rows are samples, '''File/Load Table'''; If columns are samples and rows are features, '''File/Load Rotated Table''';
  
== Minimum Spanning Tree ==
+
2. '''Calculate''' for feature clustering and pair-wise neighborhood similarity (NS);
Build MST for each clustered module so as to tell how samples are related to each other in every module. Each MST represents the progression of the module. For this step, just click on '''MST''' button when available, it will automatically generate MSTs for all clustered modules, the running state is updated in command window.
+
  
== Earth Mover Distance ==
+
3. '''Auto selection''': push '''select''' for automatic thresholding on NS matrix to provide a list of non-overlapping feature subsets (size >=3). The largest subset, on top of the list, is selected by default;
Based on the MSTs, Earth Mover's Distance(EMD) method is used for MSTs and modules to see how each MST fits each module. If the value is large, it means the MST can well present the module progression to some extent. For this step, just click on '''EMD''' button when available, it will match all MSTs and modules. The running state is updated in command window.
+
  
== Select similar modules ==
+
4. '''Manual selection''': push '''select''' to visualize co-clustered NS matrix and select a group of features that have high NS values by left clicking on the top-left starting square and releasing on the right-bottom ending square. The user can also input feature cluster index in the editor, separated by comma; The old selection is kept when '''Continuous''' is checked, or else the old selection is erased.  
Once the matching among MSTs and modules has finished, a similarity matrix for modules is built. PSM threshold, progression similarity matrix threhold, is set to determine whether two modules are similar. The similarity is one if the module is matching with its own MST. It approaches zero if the MST doesn't fit the module at all. "Selecting percentage" is the percentage of the values higher than the threshold in the similarity matrix. This param depends on the way you trust the similarity judgement. If you trust the similarity value more, just focus on the threhold; If you trust the percentage more, set the threhold so as to let the percentage fall in the range you would like it to. Click '''Show PSM''', a heatmap would pop out for you to select the similar modules. The heatmap is colored from red to blue. Dark red represents high values and dark blue low values. It's symetric so it's recommended to select along the diagonal by right button down clicking on the left-up corner starting block and right button up clicking on the right-down corner ending block. All the blocks in this symetric square will be chosen and their corresponding module IDs will be filled in the "Input hand-picked modules".
+
  
'''Recommended param setting for :'''  
+
5. '''Visualize''' to provide a 2-D or 3-D visualization using t-SNE("dimension" higher than 3 would be visualized in 2D with a selected pair of dimensions);
  
Set the threshold so that the percentage is 0.2 to 0.3.
+
6. '''MST-ordered Heatmap''' to visualize a heatmap with rows arranged by the depth-first order of MST on selected data and columns arranged by a hierarchical clustering of features. The selected ones are separated by a red line from the rest.  
  
== Progression Tree ==
 
Beyond the auto-filling module IDs by right clicking on the heatmap, you can also edit them yourself. If you want to add IDs, make sure they are seperated by comma. After these are all set, click '''View Progression''', the final progression tree is built. You can select the nodes and view the corresponding items in other views. If the samples have been clustered, a vertex represents a cluster with its size number shown near the vertex.
 
  
== Progression Heatmap ==
+
== Download ==
Progression Heatmap is built from the progression tree. Its row order is the tree node order, and its column order is the feature cluster order, the corresponding hierachical clustering dendrogram is shown above the heatmap. The heatmap is colored based on the normalized feature values. When you click vertex in the progression tree, the corresponding rows and selected feature columns will be selected.
+
 
 +
STrenD-v1.0 (implemented in C++) is available to download at
 +
 
 +
'''Windows 64 bit''':
 +
 
 +
'''Release''':
 +
https://github.com/YanXuHappygela/STrenD-release-1.0
 +
 
 +
'''Source codes''':
 +
https://github.com/YanXuHappygela/STrenD-source-1.0
 +
 
 +
Matlab wrapper is coming up soon!
 +
 
 +
If you have any problem with the software, please report to ''yansoftwareus@gmail.com''. Thank you.
 +
 
 +
== Test on Cell Cycle Microarray data ==
 +
 
 +
 
 +
For test dataset "cellCycleMicroarray.txt" with default param settings:
 +
 
 +
'''File/Load Rotated Table''' (cellCycleMicroarray.txt) -> '''Auto selection''':'''select''' ->'''Visualize'''->'''MST-ordered Heatmap'''
 +
 
 +
=== Actively-linked Visualization ===
 +
[[File:STrend2DProj.jpg |800px |thumb|center| '''Fig. 2''' 2D projection of the data with selected features. Selection in the table and 2D scatter plot are synchronized.]]
 +
 
 +
 
 +
[[File:STrenDTestRs1.png|800px |thumb|center| '''Fig. 3''' 3D projection of the data with selected features and the MST-ordered Heatmap. ]]
 +
 
 +
=== Output Files ===
 +
 
 +
For test dataset "cellCycleMicroarray.txt" with 17 samples of 3196 dimensions, clustering sigma = 0.8, k = 4:
 +
 
 +
1. '''3196_17_0.8_clustering.txt''': agglomerative clustering result, containing index and feature names;
 +
 
 +
2. '''3196_17_0.8_4_NS.txt''': pair-wise neighborhood similarity matrix of feature clusters;
 +
 
 +
3. '''Shanbhag.txt''': intermediate outputs for Shanbhag thresholding;
 +
 
 +
4. '''3196_17_0.8_4_AutoSelFeatures.txt''': selected feature index and names;
 +
 
 +
5. '''data_selected_vis.txt''': table of normalized data with selected features for visualization;
 +
 
 +
6. '''vis_coordinates.txt''': output coordinates for visualization after dimension reduction by t-SNE.

Latest revision as of 15:13, 24 February 2015

Contents

Description

The goal of this project is to develop unsupervised algorithms for discovering previously unknown subspace trends in high-dimensional data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. We aim to efficiently select trend-relevant features and derive meaningful 2-D or 3-D visualizations. The proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.

Please find more details in the following paper: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7015603

Software Interface

Fig. 1 Software interface

1. Load Tab-delimited txt file. If columns are features and rows are samples, File/Load Table; If columns are samples and rows are features, File/Load Rotated Table;

2. Calculate for feature clustering and pair-wise neighborhood similarity (NS);

3. Auto selection: push select for automatic thresholding on NS matrix to provide a list of non-overlapping feature subsets (size >=3). The largest subset, on top of the list, is selected by default;

4. Manual selection: push select to visualize co-clustered NS matrix and select a group of features that have high NS values by left clicking on the top-left starting square and releasing on the right-bottom ending square. The user can also input feature cluster index in the editor, separated by comma; The old selection is kept when Continuous is checked, or else the old selection is erased.

5. Visualize to provide a 2-D or 3-D visualization using t-SNE("dimension" higher than 3 would be visualized in 2D with a selected pair of dimensions);

6. MST-ordered Heatmap to visualize a heatmap with rows arranged by the depth-first order of MST on selected data and columns arranged by a hierarchical clustering of features. The selected ones are separated by a red line from the rest.


Download

STrenD-v1.0 (implemented in C++) is available to download at

Windows 64 bit:

Release: https://github.com/YanXuHappygela/STrenD-release-1.0

Source codes: https://github.com/YanXuHappygela/STrenD-source-1.0

Matlab wrapper is coming up soon!

If you have any problem with the software, please report to yansoftwareus@gmail.com. Thank you.

Test on Cell Cycle Microarray data

For test dataset "cellCycleMicroarray.txt" with default param settings:

File/Load Rotated Table (cellCycleMicroarray.txt) -> Auto selection:select ->Visualize->MST-ordered Heatmap

Actively-linked Visualization

Fig. 2 2D projection of the data with selected features. Selection in the table and 2D scatter plot are synchronized.


Fig. 3 3D projection of the data with selected features and the MST-ordered Heatmap.

Output Files

For test dataset "cellCycleMicroarray.txt" with 17 samples of 3196 dimensions, clustering sigma = 0.8, k = 4:

1. 3196_17_0.8_clustering.txt: agglomerative clustering result, containing index and feature names;

2. 3196_17_0.8_4_NS.txt: pair-wise neighborhood similarity matrix of feature clusters;

3. Shanbhag.txt: intermediate outputs for Shanbhag thresholding;

4. 3196_17_0.8_4_AutoSelFeatures.txt: selected feature index and names;

5. data_selected_vis.txt: table of normalized data with selected features for visualization;

6. vis_coordinates.txt: output coordinates for visualization after dimension reduction by t-SNE.

Personal tools