STrenD: Subspace Trend Discovery

From FarsightWiki
(Difference between revisions)
Jump to: navigation, search
 
(47 intermediate revisions by one user not shown)
Line 1: Line 1:
== Clustering ==
+
== Description ==
Clustering is for data dimension deduction to speed up the analysis and to achieve better progression tree. Clustering is available in both sample/feature space, and only feature clustering based on their correlation is needed now.
+
The goal of this project is to develop unsupervised algorithms for discovering previously unknown subspace trends in high-dimensional data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. We aim to efficiently select trend-relevant features and derive meaningful 2-D or 3-D visualizations. The proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.
  
For feature cluster, besides "Feature Coherence", "Feature Merge Coherence" is the Pearson correlation coefficient of two clustered modules, if the correlation coeffient of two modules exceeds the merge coherence, these two modules will be merged. Its arrange is also 0-1. Click '''Feature cluster''' for feature clustering.
+
Please find more details in the following paper:
 +
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7015603
  
'''Recommended param setting:'''
+
== Software Interface ==
  
Feature clustering: coherence = 0.95 or higher for small feature size, 0.7 for large feature size; merge coherence = 0.9.
+
[[File:STrendInterface.png|800px |thumb|center| '''Fig. 1''' Software interface]]
  
== Overall Progression or Progression over Distance ==
+
1. Load Tab-delimited txt file. If columns are features and rows are samples, '''File/Load Table'''; If columns are samples and rows are features, '''File/Load Rotated Table''';
If the checkbox '''Progression over distance to device''' is checked, the analysis is for progression over distance rather than overall progression. This is available only when distance to device has been calculated in TraceEditor.
+
  
== Module Matching ==
+
2. '''Calculate''' for feature clustering and pair-wise neighborhood similarity (NS);
Instead of previous MST/EMD module matching, a correlation-based module matching is adopted for large sample size. A similarity matrix of the feature modules will be generated by clicking '''Match'''.
+
  
== Select Similar Modules ==
+
3. '''Auto selection''': push '''select''' for automatic thresholding on NS matrix to provide a list of non-overlapping feature subsets (size >=3). The largest subset, on top of the list, is selected by default;
Once the modules matching finished, PSM(progression similarity matrix) threshold is set to determine whether two modules are similar. "PSM Selected Blocks' Percentage" is the percentage of the selected values in the similarity matrix higher than the threshold. If you want to select fewer feature modules, then set the threshold higher and percentage would drop correspondingly as fewer values are above the threshold. Click '''Show PSM''', a heatmap window would show up for you to select the similar modules. The heatmap is colored from red to blue, representing high similarity to low similarity. It's symetric so it's recommended to select along the diagonal by left button down clicking on the block where you want to start and left button up clicking on the block where you want to end. All the blocks in this symmetric square will be chosen and their corresponding feature module IDs will be filled in the "Input hand-picked modules". You would also input or delete any feature module by hand, please notice that all the feature module ID should be separated by comma.
+
  
'''Recommended param setting for :'''  
+
4. '''Manual selection''': push '''select''' to visualize co-clustered NS matrix and select a group of features that have high NS values by left clicking on the top-left starting square and releasing on the right-bottom ending square. The user can also input feature cluster index in the editor, separated by comma; The old selection is kept when '''Continuous''' is checked, or else the old selection is erased.
  
Overall progression:
+
5. '''Visualize''' to provide a 2-D or 3-D visualization using t-SNE("dimension" higher than 3 would be visualized in 2D with a selected pair of dimensions);
Set the threshold so that the percentage is around 0.3.
+
  
Progression over distance to device:
+
6. '''MST-ordered Heatmap''' to visualize a heatmap with rows arranged by the depth-first order of MST on selected data and columns arranged by a hierarchical clustering of features. The selected ones are separated by a red line from the rest.  
Sometimes the PSM threshold must be low enough to guarantee some modules are selected.
+
  
== Threshold Heatmap and Progression Tree ==
 
After the selected modules are all set, click '''View Progression''', the threshold heatmap is built. By right clicking to make a cut line across the dendrogram of the samples on the left side of the heatmap, the progression tree is built. Each node represents a cluster by the cut. You can select the nodes and view the corresponding items in other views. If you have loaded two kinds of traces all at once in the Trace Editor, say device traces and control trace, "Id to separate" is to tell them apart according to the "Root Trace" in "Computed Features for Cells", the first percentage in the tree label would tell the percentage of the cells from the first data(ID below "Id to separate")( They won't show up if all of them is 100%). The second percentage in the tree label tells how much percentage of the cells are within the distance of "700" near the device( They won't show up if the distance is not available).
 
  
== Progression Heatmap and Scatter Plot ==
+
== Download ==
By clicking "Heatmap", You can furtherly arrange the clusters in the progression tree order and check each feature over the progression in the scatter plot.
+
 
 +
STrenD-v1.0 (implemented in C++) is available to download at
 +
 
 +
'''Windows 64 bit''':
 +
 
 +
'''Release''':
 +
https://github.com/YanXuHappygela/STrenD-release-1.0
 +
 
 +
'''Source codes''':
 +
https://github.com/YanXuHappygela/STrenD-source-1.0
 +
 
 +
Matlab wrapper is coming up soon!
 +
 
 +
If you have any problem with the software, please report to ''yansoftwareus@gmail.com''. Thank you.
 +
 
 +
== Test on Cell Cycle Microarray data ==
 +
 
 +
 
 +
For test dataset "cellCycleMicroarray.txt" with default param settings:
 +
 
 +
'''File/Load Rotated Table''' (cellCycleMicroarray.txt) -> '''Auto selection''':'''select''' ->'''Visualize'''->'''MST-ordered Heatmap'''
 +
 
 +
=== Actively-linked Visualization ===
 +
[[File:STrend2DProj.jpg |800px |thumb|center| '''Fig. 2''' 2D projection of the data with selected features. Selection in the table and 2D scatter plot are synchronized.]]
 +
 
 +
 
 +
[[File:STrenDTestRs1.png|800px |thumb|center| '''Fig. 3''' 3D projection of the data with selected features and the MST-ordered Heatmap. ]]
 +
 
 +
=== Output Files ===
 +
 
 +
For test dataset "cellCycleMicroarray.txt" with 17 samples of 3196 dimensions, clustering sigma = 0.8, k = 4:
 +
 
 +
1. '''3196_17_0.8_clustering.txt''': agglomerative clustering result, containing index and feature names;
 +
 
 +
2. '''3196_17_0.8_4_NS.txt''': pair-wise neighborhood similarity matrix of feature clusters;
 +
 
 +
3. '''Shanbhag.txt''': intermediate outputs for Shanbhag thresholding;
 +
 
 +
4. '''3196_17_0.8_4_AutoSelFeatures.txt''': selected feature index and names;
 +
 
 +
5. '''data_selected_vis.txt''': table of normalized data with selected features for visualization;
 +
 
 +
6. '''vis_coordinates.txt''': output coordinates for visualization after dimension reduction by t-SNE.

Latest revision as of 15:13, 24 February 2015

Contents

Description

The goal of this project is to develop unsupervised algorithms for discovering previously unknown subspace trends in high-dimensional data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. We aim to efficiently select trend-relevant features and derive meaningful 2-D or 3-D visualizations. The proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.

Please find more details in the following paper: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7015603

Software Interface

Fig. 1 Software interface

1. Load Tab-delimited txt file. If columns are features and rows are samples, File/Load Table; If columns are samples and rows are features, File/Load Rotated Table;

2. Calculate for feature clustering and pair-wise neighborhood similarity (NS);

3. Auto selection: push select for automatic thresholding on NS matrix to provide a list of non-overlapping feature subsets (size >=3). The largest subset, on top of the list, is selected by default;

4. Manual selection: push select to visualize co-clustered NS matrix and select a group of features that have high NS values by left clicking on the top-left starting square and releasing on the right-bottom ending square. The user can also input feature cluster index in the editor, separated by comma; The old selection is kept when Continuous is checked, or else the old selection is erased.

5. Visualize to provide a 2-D or 3-D visualization using t-SNE("dimension" higher than 3 would be visualized in 2D with a selected pair of dimensions);

6. MST-ordered Heatmap to visualize a heatmap with rows arranged by the depth-first order of MST on selected data and columns arranged by a hierarchical clustering of features. The selected ones are separated by a red line from the rest.


Download

STrenD-v1.0 (implemented in C++) is available to download at

Windows 64 bit:

Release: https://github.com/YanXuHappygela/STrenD-release-1.0

Source codes: https://github.com/YanXuHappygela/STrenD-source-1.0

Matlab wrapper is coming up soon!

If you have any problem with the software, please report to yansoftwareus@gmail.com. Thank you.

Test on Cell Cycle Microarray data

For test dataset "cellCycleMicroarray.txt" with default param settings:

File/Load Rotated Table (cellCycleMicroarray.txt) -> Auto selection:select ->Visualize->MST-ordered Heatmap

Actively-linked Visualization

Fig. 2 2D projection of the data with selected features. Selection in the table and 2D scatter plot are synchronized.


Fig. 3 3D projection of the data with selected features and the MST-ordered Heatmap.

Output Files

For test dataset "cellCycleMicroarray.txt" with 17 samples of 3196 dimensions, clustering sigma = 0.8, k = 4:

1. 3196_17_0.8_clustering.txt: agglomerative clustering result, containing index and feature names;

2. 3196_17_0.8_4_NS.txt: pair-wise neighborhood similarity matrix of feature clusters;

3. Shanbhag.txt: intermediate outputs for Shanbhag thresholding;

4. 3196_17_0.8_4_AutoSelFeatures.txt: selected feature index and names;

5. data_selected_vis.txt: table of normalized data with selected features for visualization;

6. vis_coordinates.txt: output coordinates for visualization after dimension reduction by t-SNE.

Personal tools