TileMap Readme


/* ------------------------ */
/* README                   */
/* ------------------------ */

TileMap is a tool designed for tiling array analysis. It can be used to identify genomic loci that show transcriptional activities and transcription factor binding patterns of interest.
 

I.   Introduction

II.  Usage

      II.1 General

      II.2 tilemap_importaffy

      II.3 tilemap_norm

      II.4 tilemap

      II.5 tilemap_extract

      II.6 tilemap_plot for visualization

III. Examples

      III.1 Example 1

      III.2 Example 2

      III.3 Example 3

      III.4 Example 4

IV.  Input File Format

      IV.1  Raw Data File (Standard TileMap Data Format)

      IV.2  *.cmpinfo File

      IV.3  import_affy_parameter file

      IV.4  norm_parameter file

      IV.5  tilemap_parameter file

      IV.6  extract_parameter file

V.   Output File Format
      V.1   *.refmask

      V.2   *_pb.sum

      V.3   *_hmm.sum

      V.4   *_ma.sum

      V.5   *.bed

      V.6   *.reg

      V.7   *_transp.txt

      V.8   *_emissp.txt

      V.9   file exported by tilemap_extract


/* ------------------------ */
/* Introduction             */
/* ------------------------ */
TileMap consists of the following parts:

(1) tilemap_importaffy
(2) tilemap_norm
(3) tilemap
(4) tilemap_extract
(5) sample R/Matlab code tilemap_plot.R (or .m) for visualizing tilemap results

tilemap_importaffy can be used to import data from affymetrix arrays and preprocess the data (normalization & create local repeat filters). It converts *.CEL (version 3, ASCII file) and *.BPMAP files into the standard tilemap data format (see file format section).

tilemap_norm can be used to do quantile normalization (Bolstad et al., 2003). If users want to analyze non-affymetrix data, they can provide their own data in the standard tilemap data format, and use tilemap_norm to do normalization. Users do not need to use tilemap_norm if tilemap_importaffy was used to import raw data. tilemap_importaffy already provides options to do normalization.

tilemap is the central part of TileMap. It (i) computes probe-level test-statistics according to the transcriptional or protein binding patterns specified by users; (ii) filters local repeats; and (iii) infers if a region is of interest or not by applying HMM or Moving Window Average (MA). The output of tilemap includes *.sum files which provide summary statistics for each probe, a *.bed file which reports regions of interest, and a *.reg file which sorts the reported regions from high to low significance level. *.bed file can be uploaded directly to UCSC genome browser to visualize the reported regions.

tilemap_extract can be used to retrieve probes and their summary statistics in user-specified regions. The retrieved data will be saved in tab-delimited ASCII files. These files can be easily loaded into R, Matlab etc. for visualization.

Since TileMap does not provide an integrated GUI at current stage, we provide tilemap_extract and R/Matlab sample code to facilitate the visual checking of the data and tilemap results. We plan to delevop an integrated visualization system in future which will integrate TileMap with downstream analysis programs.


/* ------------------------ */
/* Usage                    */
/* ------------------------ */

#########################
# General               #
#########################

(1) Windows Users
Choose Windows "Start" menu, click "Run", type "cmd", press Enter. A window will be opened for you to input commands. Enter the directory where TileMap is installed using "cd [path]" command, and run TileMap as in examples below.

(2) Unix Users
Enter the directory where TileMap is installed, and run TileMap as in examples below.

(3) File Formats are defined in "Input File Format" and "Output File Format" section.

#########################
# tilemap_importaffy    #
#########################
[Usage]
> tilemap_importaffy [importaffy_parameter_file]

One also needs to have
(a) raw *.CEL (version 3, an ASCII file) files, and
(b) *.BPMAP file (can be downloaded from affymetrix website)

#########################
# tilemap_norm          #
#########################
[Usage]
> tilemap_norm [norm_parameter_file]

One also needs to have
(a) a raw data file in standard tilemap data format.

#########################
# tilemap               #
#########################
[Usage]
> tilemap [tilemap_parameter_file]

One also needs to have
(a) a normalized data file in standard tilemap data format, and
(b) a *.cmpinfo file to specify which pattern one wish select.


#########################
# tilemap_extract       #
#########################
[Usage]
> tilemap_extract [extract_parameter_file]

One needs to have
(a) summary statistics computed by tilemap.

 

########################

# visualization        #

########################

[Usage]

 

(1) In MatLab:

> tilemap_plot('[file name]')

e.g.

> tilemap_plot('cMycA_tile_chr21_14676034_14678449.txt')

 

(2) In R:

First, in tilemap_plot.R find a line started with "datapath <- ", and edit the line to provide a filename that specifies the data for plotting, e.g.

datapath <- "cMycA_tile_chr21_14677034_14677449.txt"

 

Then run tilemap_plot.R

 

(3) Users can modify Matlab and R codes to meet their own needs.


/* ------------------------ */
/* Examples                 */
/* ------------------------ */

Below are several examples of using TileMap. The argument files are mainly in "[tag] = [value]" format. Details about the file formats can be found in "Input File Format" and "Output File Format" section. Users need to edit the parameter files before using TileMap. Only the [value] part needs to be edited. Please do not make any changes to the tags, otherwise the program may interpret parameters in a wrong way.

To help beginning users, detailed instructions on how to set parameters are given in the sample argument files in the examples. For experienced users, succinct versions of these files can be found here:

 

Succinct files:

sample tilemap_importaffy parameter file

sample tilemap_norm parameter file

sample tilemap parameter file

sample *.cmpinfo file

sample tilemap_extract parameter file

 

Both succinct files and files with detailed instructions can be used as input of TileMap. TileMap will automatically ignore lines starting with '#' in parameter files.


#########################
# Example 1             #
#########################

The goal is to analyze affymetrix tiling arrays. One needs to load raw data from *.CEL and *.BPMAP files, to do normalization, to apply Tilemap (HMM), and to visualize a specific region.

/* step 1: load *.CEL and *.BPMAP file, normalization, prepare local repeat filter */
> tilemap_importaffy sample1_importaffy_arg.txt

/* step 2: specify transcriptional or protein binding patterns of interest in *.cmpinfo file */
refer to sample1.cmpinfo

/* step 3: tilemap main procedure */
> tilemap sample1_tilemap_arg.txt

/* step 4: export data for visualization */
> tilemap_extract sample1_extract_arg.txt

/* step 5: visualization */
refer to sample R/Matlab code, run tilemap_plot in Matlab or R.


#########################
# Example 2             #
#########################

Users have their own non-affymetrix data. The data has been organized in standard tilemap data format. Users want to do normalization, apply TileMap (MA) to call regions of interest, and use UMS to estimate local false discovery rate.

/* step 1: normalization */
> tilemap_norm sample2_norm_arg.txt

/* step 2: prepare *.cmpinfo file to specify patterns of interest and *.refmask file to define local repeat filters */
refer to sample2.cmpinfo and sample2.refmask

/* step 3: tilemap */
> tilemap sample2_tilemap_arg.txt


#########################
# Example 3             #
#########################

The probe level test-statistics has been calculated in Example 2. Users want to apply HMM-based TileMap to the existing probe level test-statistics. Users wish to provide their own selection statistics for UMS. There is no need to do local repeat filtering.

> tilemap sample3_tilemap_arg.txt


#########################
# Example 4             #
#########################

The normalized data are available. Users want to apply MA-based tilemap and estimate local false discovery rate by permutation test. There is no need to do local repeat filtering.

> tilemap sample4_tilemap_arg.txt

Users also need to prepare sample4.cmpinfo and specify the way to do permutation there.

/* ------------------------ */
/* Input File Format        */
/* ------------------------ */

##################################
# Raw Data File                  #
# (Standard TileMap Data Format) #
##################################
This is a tab-delimited file. Raw tiling array data are organized in the following format.

1st row: 'chromosome', 'position', array ids
2nd row and after: each row corresponds to a probe; probes should be arranged in the same order as they appear in the genome.

1st col(column): chromosome name
2nd col: genomic coordinate of the probe
3rd col and after: probe intensity data. Each column corresponds to an array.



##################################
# .cmpinfo File                  #
##################################
Fill out the parameters in the file. Please do not change the format of the file, leave tags such as "[Array number] = " in their original form.

(A) In "Basic Info" section, one needs to provide general information about the tiling array experiment.

[Array number]: the number of arrays to be analyzed.
[Group number]: the number of experimental conditions.
[Group ID]: numerical group ID for individual arrays. The ID's are arranged in the same order as the order arrays (columns) appear in the raw data file. IDs range from 1 to [Array number]. Negative integers can be used if one wish to ignore a specific column in the raw data file.

For example, if three mice strain "wt", "mt1" and "mt2" were profiled, each with 6 replicates. The 18 arrays are arranged in the raw data file as:
{chromosome position wt wt wt mt1 mt1 mt1 mt2 mt2 mt2 wt wt wt mt1 mt1 mt1 mt2 mt2 mt2}
then

[Array number] = 18
[Group number] = 3
[Group ID]
1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3

If one wishes to exclude the last nine arrays from the analysis, one can set
[Array number] = 9
[Group number] = 3
[Group ID]
1 1 1 2 2 2 3 3 3 -1 -1 -1 -2 -2 -2 -3 -3 -3


(B) In "Patterns of Interest" section, one sets the transcriptional or protein binding patterns of interest.

For example, if one wants to select regions that show "mt1<wt<mt2", the criteria can be set as [Comparisons]
(2<1) & (1<3)

If one wants to select regions that show "mt1<wt OR wt<mt2", the criteria can be set as
[Comparisons]
(2<1) | (1<3)


Currently, we only support the following operations:
< -- (less than)
> -- (greater than)
& -- (and)
| -- (or)
() -- (to specify operational priorities)


(C) In "Preprocessing" section, one specifies how to truncate low expression values and whether log-transformation should be taken before analysis.

For example, if one wishes to truncate all intensities that are less than 2 and set them to be 2, and wishes to take log2 transformation after the truncation, one can set
[Truncation lower bound] = 2.0
[Take log2 before calculation?] (1:yes; 0:no) = 1

If one has already done truncations and log-transformations in tilemap_importaffy or tilemap_norm, one doesn't need to do preprocessing again. In this case, one can set
[Truncation lower bound] = -1000000000000.0
[Take log2 before calculation?] (1:yes; 0:no) = 0


(D) In "Simulation Setup" section, one specifies how many Monte Carlo draws should be made to estimate the posterior probability that a probe satisfies the pattern of interest. If the comparisons specified in "Patterns of Interest" section is a two sample comparison (e.g. "1<2"), there is no need to do Monte Carlo, therefore one can set
[Monte Carlo draws for posterior prob.] = 0

If the "Patterns of Interest" involves a multiple sample comparison (e.g. "(1<2) & (1<3)"), one needs to specify a positive number, for example
[Monte Carlo draws for posterior prob.] = 1000


(E) In "Common Variance Groups", one needs to specify which experimental conditions are assumed to have common variance. The variance shrinking will be based on this setting. For example, if there are six conditions, and one assumes that for each probe, condition 1,2,3 have common within-condition variance, condition 4,5,6 have common within-condition variance, but the within-condition variance for 1,2,3 is different from within-condition variance for 4,5,6, then there are 2 common variance groups, and one can set:

variance group = 2
1 2 3
4 5 6

Each line below "variance group" tag corresponds to a common variance group, which contains all the conditions that are assumed to have common variance. The variance shrinking will be done within each variance group.

If you are not sure how to set variance group, you can assume that all conditions have the same variance. For example, you can set

variance group = 1
1 2 3 4 5 6

Setting variance group appropriately can increase the sensitivity of the analysis, especially when the number of replicate arrays are small.


(F) In "Permutation Setup", one specifies how to do permutations if one chooses to use permutation test to estimate local false discovery rate in MA.

[Number of permutations]: how many times to permute group labels.
[Exchangeable groups]: conditions that can be permuted.

For example, if one set
[Number of permutations] = 10
[Exchangeable groups] = 2
1 2 3
4 5 6

then the labels in "Group ID" will be permuted 10 times for computing FDR. The labels are permuted according to "Exchangeable groups". Arrays labeled by 1,2 and 3 will only be permuted with arrays labeled by 1, 2 and 3. Similarly, arrays labeled by 4, 5 or 6 will only be permuted with arrays labeled by 4, 5 and 6. No permutations will be done between "1, 2, 3" and "4, 5, 6". In other words, the FDR computed is a FDR for a null hypothesis H0: "1=2=3, 4=5=6".

If one wish to compute a FDR for H0: "1=2=3=4=5=6", one can set
[Number of permutations] = 10
[Exchangeable groups] = 1
1 2 3 4 5 6

If one does not want to use permutation test to compute FDR, one can set
[Number of permutations] = 10
[Exchangeable groups] = 1
1 2 3 4 5 6

Depending on the data size, permutation test may require a long time. Moreover, it is hard to estimate FDR for H0: "not {1<2<3}" using permutation test.


##################################
# importaffy_parameter_file      #
##################################
[Working directory]
The directory that contains *.CEL and *.BPMAP files. All the results generated by TileMap will be exported to this directory.

[BPMAP file]
The name of the *.BPMAP file. This file should be placed in the working directory. It will be used to sort the probes according to their genomic location and to generate a local repeat filter.

[Export file]
Please specify a file to save the converted data. The raw *.CEL data will be exported into [working directory]\[export file] in the standard tilemap data format.


[Array number]
Number of arrays.

[Arrays]
Each line below [Arrays] represent an array. Each line contains two columns, separated by a tab, the first column gives the name of the *.CEL file, and the second column gives the name of the array (provided by users to specify e.g. experimental conditions ...). For example:

IP_5_3A.CEL Jurkat_anti-cMyc_A_1_1
IP_5_4A.CEL Jurkat_anti-cMyc_A_1_2
IP_5_5A.CEL Jurkat_anti-cMyc_A_1_3
IP_1_3A.CEL Jurkat_anti-GST_A_1_1
IP_1_4A.CEL Jurkat_anti-GST_A_1_2
IP_1_5A.CEL Jurkat_anti-GST_A_1_3

The number of *.CEL files should match the [Array number].

[Apply normalization before computing intensity]
Whether or not you want to do normalization before computing probe intensities.

[Truncation lower bound before normalization]
If you choose to do normalization, you need to specify how to truncate low expression values. All values < [truncation lower bound] will be set to [truncation lower bound] before normalization.

[Take log2 transformation before normalization]
Whether or not you wish to take log2 transformation before normalization. If you choose yes, the truncated values will be log-transformed, and the normalization will be applied to the transformed values. If you choose no, the normalization will be applied to the un-log-transformed values, and you can choose to do log-transformation later.

[How to compute intensity]
You can choose to use normalized PM values as the probe intensity; or you can choose to use PM-MM as the intensity.

[Truncation lower bound after intensity computation]
After you compute PM only or PM-MM intensities, how would you truncate low intensities. All intensities < [truncation lower bound] will be set to [truncation lower bound]. If you have already taken log-transformation before, you may need to set a small number here such as -10000000000.0.

[Take log2 transformation after intensity computation]
Whether or not you wish to take log2 transformation after you get intensities. If you have already carried out log-transformation before normalization, you should choose "no" here.



##################################
# norm_parameter_file            #
##################################
[Working directory]
The directory that contains the raw data file. All the results generated by TileMap will be exported to this directory.

[Raw Data file]
The name of the raw data file. It should be placed in the working directory and should be in standard tilemap data format.

[Export file]
The name of the file where normalized data will be saved. This file will be generated in the working directory.

[Array number]
Number of arrays.

[Truncation lower bound before normalization]
You need to specify how to truncate low expression values. All values < [truncation lower bound] will be set to [truncation lower bound] before normalization.

[Take log2 transformation before normalization]
Whether or not you wish to take log2 transformation before normalization. If you choose yes, the truncated values will be log-transformed, and the normalization will be applied to the transformed values. If you choose no, the normalization will be applied to the un-log-transformed values, and you can choose to do log-transformation later in tilemap.



##################################
# tilemap_parameter_file         #
##################################

O.1-[Working directory]
The directory that contains raw data files. All the results generated by TileMap will be exported to this directory.

O.2-[Project Title]
A title of the project. This title will be used to generate names of output files.

I.1-[Compute probe level test-statistics?]
Specify whether or not tilemap should compute the probe level test-statistics. If one only has normalized raw data, one should choose "yes". If one has already pre-computed probe level test-statistics and only wants to apply HMM or MA to do region level inference, one can choose "no".

I.2-[Raw data file]:
If you choose "yes" in I.1, you need to prepare two files in the working directory:
(i) A raw data file which contains the normalized probe intensities. This file should be in standard tilemap data format, and should be placed in the working directory. Give its name in I.2.
(ii) You also need to prepare a *.cmpinfo file named {Project Title}.cmpinfo in the working directory, which specifies the hybridization pattern you wish to select. However, you DON'T need to provide its name in I.2.
If you choose "no" in I.1, please prepare a file that contains precomputed probe level test-statistics. This file should be in *_pb.sum format (see "Output File Format") and should be placed in the working directory. Provide its name in I.2. It will be used as the input for HMM and MA. In this case, the probe level computation embedded in TileMap will be skipped.
NOTICE: in tilemap, small values of probe level test-statistics correspond to patterns of interest. When you provide your own probe level test-statistics, you may need to transform them somehow to follow this convention.

I.3-[Range of test-statistics]
Specify the range of probe level test-statistics.
If you choose "yes" in I.1, you can set I.3 to 0 (default). Tilemap will compute probe level test statistics and determine the range automatically. For two sample comparisons, the probe level test-statistic is an improved t-statistic, the range will be (-inf, +inf). For multiple sample comparisons, the probe level test-statistic is a posterior probability, the range will be [0,1].
If you choose "no" in I.1 and provide your own probe level test-statistics, then you should set I.3 either to 1 {[0,1]} or 2 {(-inf, +inf)} depending on whether the statistics you provided in I.2 fall within [0,1] (e.g. posterior probability) or (-inf, +inf) (e.g. t-statistics).
[0,1] statistics will be transformed by log[t/(1-t)] before applying MA, and the MA statistics will be transformed back by exp(u)/[exp(u)+1] before applying UMS to estimate local FDR. (-inf, +inf) statistics will be transformed by exp(t)/[exp(t)+1] before applying HMM.

I.4-[Zero cut]
To avoid logit(0), please specify a zero cut in I.4. [0,1] test-statistics will be set to max(zero_cut/2, min(t, 1-zero_cut/2)) before taking logit transformation. E.g. If [Monte Carlo draws for posterior prob.] = 10000 in *.cmpinfo file, you can set zero_cut = 0.0001.

II.1-[Apply local repeat filter?]
Whether or not you want to mask local repeats. Some probes occur more than once in a region, such local repeats may result in noise due to cross-hybridizations. You may wish to exclude these probes from analysis. If so, you need to apply the filter. If the data you provided have already been repeat-masked, you can choose "no" to skip this step.

II.2-[*.refmask file]
If you choose "yes" in II.1, please prepare a *.refmask file (see "Output File Format") which provides non-redundant probes and counts how many times each probe occur in a local region. You need to provide its name in II.2. This file will be used as a reference for masking local repeats.
Hint: using tilemap_importaffy to load affymetrix data from *.CEL and *.BPMAP will automatically create a *.refmask file.
If you choose "no" in II.1, set II.2 to NULL. Local repeat filtering will be skipped.

III.1-[Combine neighboring probes?]
Whether or not you want to apply HMM or MA to do region inference. If you choose no, tilemap will skip HMM and MA. If you choose yes, tilemap will combine neighboring probes to infer whether a region is of interest or not.

III.2-[Method to combine neighboring probes]
Choose which method should be used to do region summary.
If you choose "Yes" in III.1 and "HMM" in III.2, please fill out Step IV and leave Step V to its default values.
If you choose "Yes" in III.1 and "MA" in III.2, please fill out Step V and leave Step IV to its default values.
If you choose "No" in III.1, you can set III.2 arbitrarily to 0 or 1 and leave both Step IV and Step V to their default values. Region summary will then be skipped.

IV.1-[Posterior probability >]
Posteriror probability cutoff to call regions of interest in HMM.

IV.2-[Maximal gap allowed]
d0 in HMM. If the distance between the neighboring probes i and i+1, d(i,i+1), is no greater than d0, tilemap will use the HMM transition probability matrix to compute likelihood. If d(i,i+1) > d0, tilemap will restart a new HMM from i+1. (refer to Ji&Wong, 2005 for details).

IV.3-[Method to set HMM parameters]
You can choose to use UMS embedded in tilemap to get HMM parameters or to provide your own HMM parameters.
If you choose "UMS" in IV.3, please set UMS parameters in IV.4 - IV.10. Otherwise leave them to be default values.
If you choose "Set by users" in IV.3, please provide your own transition, emission probability matrices in _transp.txt and _emissp.txt format. You should place these two files in the working directory and provide their names in IV.9 and IV.10. Otherwise set IV.9 and IV.10 to be NULL.

IV.4-[Provide your own selection statistics?]
If you choose to use UMS to get HMM parameters, you have the option to provide your own selection statistics. If you do not provide your own selection statistics, tilemap will use the probe level test-statistics as the default selection statistics.

IV.5-[If Yes to IV.4, selection statistics file]
If you choose to provide your own selection statistics, please prepare the statistics in a *_pb.sum file and provide its name in IV.5. The file should be in working directory.

IV.6-[G0 Selection Criteria, p%]
Set t(p) in UMS. If a probe has a selection statistic > t(p), its downstream probe will be used to construct g0.

IV.7-[G1 Selection Criteria, q%]
Set t(q) in UMS. If a probe has a selection statistic <= t(q), its downstream probe will be used to construct g1.

IV.8-[Selection Offset]
If probe i has a selection statistic > t(p), probe (i+selection_offset) will be used to construct g0. Similar for g1.

IV.9-[Grid Size]
How many intervals should [0,1] be divided into. For example, if grid size = 1000, [0,1] will be divided into 0.001, 0.002, ..., 1.000. g0 and g1 will be estimated by empirical distributions on this grid. The choice of grid size should consider the number of probes available. On average, it would be better to have a few hundred probes in each interval.

IV.10-[Expected hybridization length]
The number of probes contained in a typical hybridization region. For example, in ChIP-Chip experiment, if IP fragment length = 1000bp, probe density= 1 probe / 35 bp. Then one would expect to observe 28 probes on average in a binding region, and one can set expected hybridization length = 28.

IV.11-[Path to transition probability matrix]
If you choose "Set by users" in IV.3, please prepare a transition probability matrix in working directory and in _transp.txt format (see "Output File Format"). Provide its name here.

IV.12-[Path to emission probability matrix]
If you choose "Set by users" in IV.3, please prepare a emission probability matrix in working directory and in _emissp.txt format (see "Output File Format"). Provide its name here.

V.1-[Local FDR <]
Local false discovery rate cutoff to call regions of interest in MA.

V.2-[Maximal gap allowed]
Two signifcant probes, if their distance <= [maximal gap allowed], will be treated as a single region. For example, in ChIP-Chip experiment, if IP fragment length = 1000bp, one can set maximal gap allowed = 500, half of the IP fragment length.

V.3-[W]
The half window size. The moving average will be taken over a 2*W+1 window, i.e. each window will contain 2*W+1 probes.

V.4-[Method to compute local FDR]
You can choose to use UMS or permutation test to compute local FDR.
If you choose "UMS" in V.4, please set UMS parameters in V.5 - V.10. If you choose "Permutation Test" in V.4, please set grid size in V.10, and then go back to {Project Title}.cmpinfo file and fill out its "Permutation Setup" section. There you will set the way to do permutations and number of permutations you want to do.
Hint: depending on the size of the data, permutation test could be very slow.

V.5-[Provide your own selection statistics?]
If you choose to use UMS to get local FDR, you have the option to provide your own selection statistics in UMS. If you choose not to provide your own selection statistics, tilemap will use probe level test-statistics as the selection statistics.

V.6-[If Yes to IV.4, selection statistics file]
If you choose to provide your own selection statistics, please prepare the statistics in a *_pb.sum file and provide its name here. The file should be in working directory.

V.7-[G0 Selection Criteria, p%]
Set t(p) in UMS. If a probe has a selection statistic > t(p), its downstream probe will be used to construct g0.

V.8-[G1 Selection Criteria, q%]
Set t(q) in UMS. If a probe has a selection statistic <= t(q), its downstream probe will be used to construct g1.

V.9-[Selection Offset]
If probe i has a selection statistic > t(p), probe (i+selection_offset) will be used to construct g0. Similar for g1. Usually, selection offset = W+1 in MA.

V.10-[Grid Size]
How many intervals should [0,1] be divided into. For example, if grid size = 1000, [0,1] will be divided into 0.001, 0.002, ..., 1.000. g0 and g1 will be estimated by empirical distributions on this grid. The choice of grid size should consider the number of probes available. On average, it would be better to have a few hundred probes in each interval.



##################################
# extract_parameter_file         #
##################################

[Working directory]
The directory that contains raw data file and all the tilemap-generated files. All the new files generated by this command will be exported to this directory.

[Project Title]
The title of the project. The program will automatically extract data from [Project Title]_f_pb.sum, [Project Title]_hmm.sum, [Project Title]_ma.sum and the raw data file if available.

[Probe Level Summary]
You are required to provide a probe level summary file in _pb.sum format. The program will decide which probes to retrieve based on this file.

[Raw Data]
You can provide the name of the raw data file. This file should be in working directory. If provided, raw data will be extracted. If you set NULL here, the program will only get test-statistics for target probes. No raw data will be extracted.

[Regions]
Each line below [Regions] represent a region you wish to extract. Each line contains at least three columns, tab-delimited.

col1: chromosome name
col2: start coordinate in the chromosome
col3: end coordinate in the chromosome
other columns: defined by users themselves.

e.g.
chr21 14676034 14678449 target 651 +
chr21 17421450 17423637 target 651 +
chr21 18111757 18113819 target 651 +
chr21 26027537 26031330 target 651 +

for each region, the program will generate a file named [Project title]_[chromosome]_[start]_[end].txt in the working directory. The file will contain all the probes in the specified region, their coordinates and summary statistics.


/* ------------------------ */
/* Output File Format       */
/* ------------------------ */

##################################
# *.refmask                      #
##################################
This is a tab-delimited file used for sorting probes based on their genomic coordinates and for filtering local repeats.

col1: chromosome name
col2: coordinate in the chromosome
col3: how many probes in the array are mapped (without any mismatches) to the position specified by col1 and col2.
col4: within a local window (2000 bp as the tilemap default) centered at the col1-col2, how many genomic loci have the same probe sequence as the sequence specified by col1-col2. If the number is >1, the probe in question will be treated as a local repeat and will be filtered out later.
col5: probe sequence

##################################
# *_pb.sum                       #
##################################
This is a tab-delimited file to record probe level test-statistics.

col1: chromosome name
col2: coordinates in the chromosome
col3: probe-level test-statistics. The statistics are transformed such that the smaller the statistics, the more significant.


##################################
# *_hmm.sum                      #
##################################
This is a tab-delimited file to record posterior probability generated by HMM.

col1: chromosome name
col2: coordinates in the chromosome
col3: posterior probability that a probe is in a region of interest. The larger the posterior probability, the more significant a probe is.


##################################
# *_ma.sum                       #
##################################
This is a tab-delimited file to record MA summaries.

col1: chromosome name
col2: coordinates in the chromosome
col3: Moving average (MA) statistics
col3: Local false discovery rate that a probe is in a region of interest. The smaller the local FDR, the better.

##################################
# *.bed                          #
##################################
This is a UCSC *.bed file to report significant regions. Regions are sorted according to their genomic locations.

col1: chromosome name
col2: region start
col3: region end
col4: no meaning
col5: 1000*[hmm posterior probability] or 1000*(1-lfdr of MA)
col6: always +


##################################
# *.reg                          #
##################################
This is a tab-delimited file to report significant regions. Regions are ranked according to their significance levels.

col1: chromosome name
col2: start coordinate
col3: end coordinate
col4: the line # of the starting probes (to help locate the probe in *_pb.sum, *_hmm.sum and *_ma.sum files)
col5: the line # of the ending probe
col6: maximum posterior probability or minimum local FDR of all the probes in the region
col7: mean posterior probability or mean local FDR of the regions. If the region is formed by merging two discrete regions that are separated by less than [maximal gap], then the mean is obtained as follows: first, compute two means for the two discrete regions separately; then, take the minimum of the two means and use it as the mean here.


##################################
# _transp.txt                    #
##################################
HMM transition probability matrix, in the format:

1-a0 a0
a1 1-a1



##################################
# _emissp.txt                    #
##################################
HMM emission probability matrix, in the format:

interval(1) interval(2) interval(3) ... interval(n)
f0(1) f0(2) f0(3) ... f0(n)
f1(1) f1(2) f1(3) ... f1(n)

A probe level test-statistic, t, if interval(i-1)<t<=interval(i), then f0(t)=f0(i) (the likelihood for H=0); and f1(t)=f1(i) (the likelihood for H=1).

NOTICE: interval(i) should equally divide [0,1], i.e. interval(i+1)-interval(i) = interval(i)-interval(i-1). interval(n) is always 1.0. Although not explicitly defined in the file, interval(0)=0.


##################################
# file exported by               #
# tilemap_extract                #
##################################
This is a tab-delimited file.

col1: probe coordinate in chromosome
col2: probe level test-statistics
col3: HMM posterior probability
col4: MA statistics
col5: local FDR for MA
col6 and after: raw data

 

 

 

REFERENCES:

 

Bolstad, B.M., Irizarry, R.A., Astrand, M. and Speed, T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics, 19(2), 185-193.

 

Ji, H. and Wong, W.H. (2005) TileMap: create chromosomal map of tiling array hybridizations. (Submitted).