Workshop "Principal Manifolds-2006"
August 24-26, 2006
Leicester, UK


Test datasets (microarray data)


Free exploration of microarray data

Before coming to the workshop we propose to have a look at three datasets containing results of a high-throughput experimental technology application in molecular biology (microarray data). Principal component analysis and principal manifolds are highly demanded methods for analysis of this kind of data. It would be very interesting if you could apply your own methods to all or some of these datasets and present the results at the workshop such that we could compare them together.

The task could be free exploratory data analysis: any kind of message about the structure of the point distribution and relation of this structure to the proposed ab initio gene and sample classifications is interesting.

Challenging tasks

Following our discussion in Leicester, we have decided to formulate additional list of problems which one can try to solve using their own methods. Here is possible problem list which is by no means complete:

  1. Characterize cluster structure of the datasets and evaluate label enrichment in every cluster.
  2. Compare sample classifiers constructed in the full gene space and after projection onto the manifolds. Try to demonstrate that principal manidolds may help to regularize classification problems. Compare with regularization by linear PCA.
  3. Demonstrate that projection onto principal manifolds/curves can be used to recover data missing values. Compare performance with linear methods (regression, PCA).

Some additional information facilitating analysis and understanding datasets is added to the end of every dataset description.


Data description

The nature of all three datasets is similar. They are "samples vs genes" tables, containing logarithms of expression levels of several thousands (n) genes in several tens (m) of samples. Two first datasets contain data about tumor samples from bladder and breast cancer. The third dataset contains data about samples of various normal (healthy) tissues.

One of the basic features of these datasets is the fact that n>>m. The second feature is incompleteness of the data: there is a number of gaps (no reliably measured values) in Dataset III. Datasets I and II do not contain missing values (in Dataset I the missing values are recovered and in the Dataset II all gaps are filtered out). Datasets III is filtered in such a way that any row (gene) has at least 75% complete values of expression and every column has at least 80% of complete values.

We can say that there are two spaces in which the datasets can be analysed, let us call them "gene space" and "sample space". In the gene space every point (vector of dimension m) is a gene, characterized by its expression in m samples. In the sample space every point is a sample (vector of dimension n), characterized by expression profile of n genes. In the tables given below every row corresponds to a gene: thus, the default space is the gene space. To perform analysis in the sample space, one has to transpose the table (we did this operation and provide the transposed tables as well in the appendix).

The rows (genes) in Dataset I and Dataset II are preprocessed such that they have zero mean. This is a standard procedure for exploratory microarray data analysis because only variation of the expression between samples and not the absolute average expression level has biological meaning.


Data format

All three datasets are represented in one standard tab-delimited text format, suitable for working with most of applications. First row gives the column names, the first column is always a gene identifier and the rest are the sample names. Table rows correspond to genes but in the files in the appendix the transposed table is also given for convinience. Missing values in the table are denoted by "NULL" string. For every table we provide a file with ab initio (biological) classification of samples and list of gene names.

We also provide the same datasets in .dat format which is suitable for application of VidaExpert and ViMiDa software. The format is very simple, it consists of a little header specifying sample class information and a body which is simply a tab- or space- delimited text table. The format of the header is the following:

<number_of_columns> <number_of_rows> <Field1_name> <Field1_type> [<Field1_class1>] [<Field1_class2>] ... [<Field1_classm>] <Field2_name> <Field2_type> [<Field2_class1>] [<Field2_class2>] ... [<Field2_classm>] ... <Fieldn_name> <Fieldn_type> [<Fieldn_class1>] [<Fieldn_class2>] ... [<Fieldn_classm>]

The field names should not contain spaces. The Field_type can be only of two types: STRING (string of text) and FLOAT (numerical). The Field_classX columns are optional, they also should not contain spaces.

For your convinience, we put also tab-delimited numerical tables which can be easily imported into Matlab or R environment if your methods are implemented in Matlab or R. Thus you can even do not care about the data origin.


Dataset I - "Five types of breast cancer"

Number of samples: 286

Number of genes: 17816

3 ab initio sample classifications:

Zipped tab-delimited file: d1.txt.zip

Table with ab initio sample classification: d1_sample_classes.txt

Table with gene names and symbols: d1_gene_info.txt

Appendix:

Zipped dat-file (suitable for VidaExpert and ViMiDa): d1.dat.zip

Transposed table for performing analysis in the sample space (every row corresponds to a sample): d1_t.txt.zip

Purely numerical table (to forget about the data origin) for Matlab, R calculations: d1n.txt.zip

Reference:
Wang, Y., Klijn, J.G., Zhang, Y., Sieuwerts, A.M., Look, M.P., Yang, F., Talantov, D., Timmermans, M., Meijer-van Gelder, M.E., Yu, J. et al. (2005) Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365, 671-679.

Additional information

Wang et al. paper PDF

Meaningfull classifier should try to distinguish labels A from B. This will allow to predict if a patient will suffer from metastases for at least 5 years after surgery.

Short paper synopsis: 76 genes were selected as a predictive "signature" to be used in classification. Prediction was done separately for ER+ (60 genes were used) and ER- groups (16 genes were used). The authors report 93% sensitivity (real agressive cancers which were predicted to be agressive) and 48% specificity (real non-agressive cancers which were predicted to be non-agressive) of the classfier. ATTENTION: this classification results are obtained after gene preselection, i.e. selecting genes which already show some separation of A vs B alone. One should be cautious in comparing these results with "blind" classification in multidimensional space without gene selection.


Dataset II - "Three types of bladder cancer"

Number of samples: 40

Number of genes: 3036

2 ab initio sample classifications:

Zipped tab-delimited file: d2f.txt.zip

Table with ab initio sample classification: d2_sample_classes.txt

Table with gene names and symbols: d2_gene_info.txt

Appendix:

Zipped dat-file (suitable for VidaExpert and ViMiDa): d2f.dat.zip

Transposed table for performing analysis in the sample space (every row corresponds to a sample): d2f_t.txt.zip

Purely numerical table (to forget about the data origin) for Matlab, R calculations: d2fn.txt.zip

Reference:
L.Dyrskjot et al (2003) Identifying distinct classes of bladder carcinoma using microarrays. Nat Genetics 33(1):90-6.

Additional information

L.Dyrskjot et al. paper PDF

Short paper synopsis: hierarchical clustering corresponds well to three known clinical classes (stages). Stage Ta can be subdivided into two natural sub-classes. In class Ta one can construct a classifier which classifies recurring (bad) Ta tumors from non-recurring Ta tumors with 75% of correctly classified samples.


Dataset III - "Healthy tissues"

Number of samples: 103

Number of genes: 10383

1 ab initio sample classification:

Zipped tab-delimited file: d3ef.txt.zip

Table with ab initio sample classification: d3_sample_classes.txt

Table with gene names and symbols: d3_gene_info.txt

Appendix:

Zipped dat-file (suitable for VidaExpert and ViMiDa): d3ef.dat.zip

Transposed table for performing analysis in the sample space (every row corresponds to a sample): d3ef_t.txt.zip

Purely numerical table (to forget about the data origin) for Matlab, R calculations: d3efn.txt.zip

Reference:
Shyamsundar R, Kim YH, Higgins JP, Montgomery K, Jorden M, Sethuraman A, van de Rijn M, Botstein D, Brown PO, Pollack JR. (2005) A DNA microarray survey of gene expression in normal human tissues. Genome Biol. 6(3):R22

Additional information

Shyamsundar et al. paper PDF

Short paper synopsis: Unsupervised classification (hierarchical clustering) was done and it was found that groups of samples well correspond to different tissues. IN supervised approach (SAM) some genes specific for every tissue were characterized.


Contacts

Prof. Alexander Gorban
University of Leicester, UK
ag153_at_leicester.ac.uk
http://www.math.le.ac.uk/people/ag153/homepage

Dr. Andrei Zinovyev
Bioinformatics Service of Institut Curie, Paris
andrei.zinovyev_at_curie.fr
http://www.ihes.fr/~zinovyev