Algorithmic Information Theoretic Prediction & Discovery
Mcayouette (Talk | contribs) |
(→Usage) |
||
(23 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | Our approach uses algorithmic information theory to analyze results from | + | Our approach uses algorithmic information theory to analyze results from segmenting and tracking objects in the image sequences. Multi-target tracking is used to establish temporal correspondences between the segmented objects. A multi-resolution algorithmic information theoretic distance measure computes pairwise comparisons between time courses of object feature values. This results in an ensemble of distance matrices that are further analyzed using a class of semi-supervised spectral learning techniques. Our approach is semi-supervised. If final outcomes are unknown, it functions as a "tool for discovery," and uses a clustering model based on the notion of randomness deficiency from algorithmic statistics to capture as much meaningful information from image sequence data as possible. If final outcomes are known it becomes semi-supervised, incorporating both labeled and unlabeled data for the pair-wise comparisons and then using cross validation together with supervised spectral learning for the most accurate predictions based on behavior. |
− | + | ||
+ | [http://www.ircm.qc.ca/microsites/cayouette/en/ Link to our our biological collaborators in McGill Canada] | ||
[[Algorithmic Information Theoretic Prediction example]] | [[Algorithmic Information Theoretic Prediction example]] | ||
+ | |||
+ | ==== Download ==== | ||
+ | |||
+ | The algorithmic information theoretic prediction and discovery tools have not been integrated into the FARSIGHT build environment. As a temporary measure, source code and Win 32 binaries can be downloaded from [https://pantherfile.uwm.edu/cohena/www/aitpd%20main.html Andrew Cohen - UWM] | ||
+ | |||
+ | ==== Usage ==== | ||
+ | |||
+ | '''I. Segment and Track''' (folder: .\segment and track) | ||
+ | |||
+ | The first step in the process is to segment and track the image sequence data. This step is typically application-specific; however there are some standard software packages that can help ([http://farsight-toolkit.org/wiki/Main_Page FARSIGHT]). This release includes segmentation and tracking for vertebrate retinal progenitor cells (folder "segment and track") and also for a simulated data set (folder "phantom 2"). | ||
+ | |||
+ | The results of the segmentation and tracking form the input to the algorithmic information theoretic prediction and discovery tools. The Matlab file GenerateDistanceMatrices\ExportTS.m illustrates how to export Matlab data into the text format used by the algorithmic information theoretic prediction and discovery tools. All the objects being analyzed must be exported to a single file (except in the case of the real-time solution described below). | ||
+ | |||
+ | '''II. Generate distance matrices''' (folder: .\GenerateDistanceMatrices) | ||
+ | |||
+ | The next step is to generate an ensemble of distance matrices. One distance matrix is generated for each feature subset and level of quantization. The folder "GenerateDistanceMatrices” contains software to accomplish this. | ||
+ | |||
+ | Usage: | ||
+ | |||
+ | GenerateDistanceMatrices <input file (exported in previous step)>,<output folder> | ||
+ | |||
+ | '''III. Algorithmic information theoretic prediction and discovery''' | ||
+ | |||
+ | The remainder of the processing analyzes the distance matrices generated in the previous step. There are two options on how we proceed. In the unsupervised case [1] we find the most concise and meaningful description of the input multidimensional time sequence data by using gap spectral clustering (our approximation to randomness deficiency). We refer to this unsupervised approach as algorithmic information theoretic discovery. In the semi-supervised case [2] we have known outcomes and we wish to find a model capable of most accurately predicting outcomes, as measured by cross validation. We refer to the semi-supervised approach as algorithmic information theoretic prediction. These two steps are detailed below. | ||
+ | |||
+ | '''a. Algorithmic information theoretic discovery''' (folder: .\GapSpectral) | ||
+ | |||
+ | This unsupervised approach is based on using the gap statistic to estimate the amount of meaningful information captured by a given model setting (feature subset ''F'', quantization level ''N'', number of clusters in the data ''k''). Model selection is accomplished by searching over (''F,N,k'') for the settings that maximize the gap statistic and thus the meaningful information. | ||
+ | |||
+ | Usage: | ||
+ | |||
+ | GapSpectral <distance matrices path> | ||
+ | |||
+ | The output is the values of (''F,N,k'') that maximize the gap statistic, and the list of cluster assignments for each input object. | ||
+ | |||
+ | A useful helper file is DrawSpectral.m. This file allows rendering the input time sequence data as points in a k-dimensional space, using the most meaningful (''F,N,k'') discovered above. | ||
+ | |||
+ | |||
+ | '''b. Algorithmic information theoretic prediction''' (folder: .\aitpPC, .\aitpMPI) | ||
+ | |||
+ | When there are known outcomes for some, or all, of the objects being analyzed semi-supervised algorithmic information theoretic prediction can be used. This approach uses a semi-supervised learning algorithm in the spectral domain rather than gap spectral clustering. Cross validation is used to estimate the best model setting (feature subset ''F'', quantization level ''N'', dimensionality of the eigenspace ''k''). | ||
+ | |||
+ | Usage: | ||
+ | |||
+ | aitpPC <distance matrices path> | ||
+ | |||
+ | The file idxTrue.txt must be in the same folder as the distance matrices. The output is the values of (''F,N,k'') that minimize the prediction error as measured by cross validation. | ||
+ | |||
+ | |||
+ | '''IV. Frame rate real-time''' (folder: .\Seg and Track RT, .\aitpRT) | ||
+ | |||
+ | Frame rate real-time refers to the ability to segment and track and generate a prediction before the next image in the sequence is available. | ||
+ | |||
+ | Usage: | ||
+ | |||
+ | GoSegAndTrack.m | ||
+ | commands: | ||
+ | s: start monitoring a new folder | ||
+ | t: finalize the tracking of a given folder | ||
+ | x: export time sequence data for prediction analysis | ||
+ | |||
+ | This Matlab utility starts the segmentation and tracking code. It launches multiple threads of the program SnT.exe. SnT.exe must be compiled using the Matlab compiler (mcc) - the compilation command is in a comment in the file. | ||
+ | GoSegAndTrack is currently designed to start segmenting and tracking a fixed number of all the sub-folders of the initial directory. Additional folders can be added at the command prompt. Whenever a new image arrives in one of the folders, it is automatically segmented and the tracking is updated. | ||
+ | |||
+ | aitpRT.exe <training file> | ||
+ | |||
+ | This program is run in conjunction with GoSegAndTrack.m. It repeatedly prompts for data set ID and tracking ID and then generates predictions. | ||
+ | |||
+ | ---- | ||
+ | |||
+ | == Reference == | ||
+ | |||
+ | [1] A. R. Cohen, C. Bjornsson, S. Temple, G. Banker, and B. Roysam, "Automatic Summarization of Changes in Biological Image Sequences using Algorithmic Information Theory," IEEE Transactions on Pattern Analysis and Machine Intelligence, in press, 2008. | ||
+ | |||
+ | [2] A. R. Cohen, F. Gomes, B. Roysam, and M. Cayouette, "Computational prediction of retinal progenitor cell fate outcomes from time-lapse microscopy data," Nature Methods, in review, 2009. |
Latest revision as of 22:18, 30 May 2009
Our approach uses algorithmic information theory to analyze results from segmenting and tracking objects in the image sequences. Multi-target tracking is used to establish temporal correspondences between the segmented objects. A multi-resolution algorithmic information theoretic distance measure computes pairwise comparisons between time courses of object feature values. This results in an ensemble of distance matrices that are further analyzed using a class of semi-supervised spectral learning techniques. Our approach is semi-supervised. If final outcomes are unknown, it functions as a "tool for discovery," and uses a clustering model based on the notion of randomness deficiency from algorithmic statistics to capture as much meaningful information from image sequence data as possible. If final outcomes are known it becomes semi-supervised, incorporating both labeled and unlabeled data for the pair-wise comparisons and then using cross validation together with supervised spectral learning for the most accurate predictions based on behavior.
Link to our our biological collaborators in McGill Canada
Algorithmic Information Theoretic Prediction example
Download
The algorithmic information theoretic prediction and discovery tools have not been integrated into the FARSIGHT build environment. As a temporary measure, source code and Win 32 binaries can be downloaded from Andrew Cohen - UWM
Usage
I. Segment and Track (folder: .\segment and track)
The first step in the process is to segment and track the image sequence data. This step is typically application-specific; however there are some standard software packages that can help (FARSIGHT). This release includes segmentation and tracking for vertebrate retinal progenitor cells (folder "segment and track") and also for a simulated data set (folder "phantom 2").
The results of the segmentation and tracking form the input to the algorithmic information theoretic prediction and discovery tools. The Matlab file GenerateDistanceMatrices\ExportTS.m illustrates how to export Matlab data into the text format used by the algorithmic information theoretic prediction and discovery tools. All the objects being analyzed must be exported to a single file (except in the case of the real-time solution described below).
II. Generate distance matrices (folder: .\GenerateDistanceMatrices)
The next step is to generate an ensemble of distance matrices. One distance matrix is generated for each feature subset and level of quantization. The folder "GenerateDistanceMatrices” contains software to accomplish this.
Usage:
GenerateDistanceMatrices <input file (exported in previous step)>,<output folder>
III. Algorithmic information theoretic prediction and discovery
The remainder of the processing analyzes the distance matrices generated in the previous step. There are two options on how we proceed. In the unsupervised case [1] we find the most concise and meaningful description of the input multidimensional time sequence data by using gap spectral clustering (our approximation to randomness deficiency). We refer to this unsupervised approach as algorithmic information theoretic discovery. In the semi-supervised case [2] we have known outcomes and we wish to find a model capable of most accurately predicting outcomes, as measured by cross validation. We refer to the semi-supervised approach as algorithmic information theoretic prediction. These two steps are detailed below.
a. Algorithmic information theoretic discovery (folder: .\GapSpectral)
This unsupervised approach is based on using the gap statistic to estimate the amount of meaningful information captured by a given model setting (feature subset F, quantization level N, number of clusters in the data k). Model selection is accomplished by searching over (F,N,k) for the settings that maximize the gap statistic and thus the meaningful information.
Usage:
GapSpectral <distance matrices path>
The output is the values of (F,N,k) that maximize the gap statistic, and the list of cluster assignments for each input object.
A useful helper file is DrawSpectral.m. This file allows rendering the input time sequence data as points in a k-dimensional space, using the most meaningful (F,N,k) discovered above.
b. Algorithmic information theoretic prediction (folder: .\aitpPC, .\aitpMPI)
When there are known outcomes for some, or all, of the objects being analyzed semi-supervised algorithmic information theoretic prediction can be used. This approach uses a semi-supervised learning algorithm in the spectral domain rather than gap spectral clustering. Cross validation is used to estimate the best model setting (feature subset F, quantization level N, dimensionality of the eigenspace k).
Usage:
aitpPC <distance matrices path>
The file idxTrue.txt must be in the same folder as the distance matrices. The output is the values of (F,N,k) that minimize the prediction error as measured by cross validation.
IV. Frame rate real-time (folder: .\Seg and Track RT, .\aitpRT)
Frame rate real-time refers to the ability to segment and track and generate a prediction before the next image in the sequence is available.
Usage:
GoSegAndTrack.m commands: s: start monitoring a new folder t: finalize the tracking of a given folder x: export time sequence data for prediction analysis
This Matlab utility starts the segmentation and tracking code. It launches multiple threads of the program SnT.exe. SnT.exe must be compiled using the Matlab compiler (mcc) - the compilation command is in a comment in the file. GoSegAndTrack is currently designed to start segmenting and tracking a fixed number of all the sub-folders of the initial directory. Additional folders can be added at the command prompt. Whenever a new image arrives in one of the folders, it is automatically segmented and the tracking is updated.
aitpRT.exe <training file>
This program is run in conjunction with GoSegAndTrack.m. It repeatedly prompts for data set ID and tracking ID and then generates predictions.
Reference
[1] A. R. Cohen, C. Bjornsson, S. Temple, G. Banker, and B. Roysam, "Automatic Summarization of Changes in Biological Image Sequences using Algorithmic Information Theory," IEEE Transactions on Pattern Analysis and Machine Intelligence, in press, 2008.
[2] A. R. Cohen, F. Gomes, B. Roysam, and M. Cayouette, "Computational prediction of retinal progenitor cell fate outcomes from time-lapse microscopy data," Nature Methods, in review, 2009.