| |
TileMap Readme
/* ------------------------ */
/* README
*/
/* ------------------------ */
TileMap is a tool designed for tiling array analysis. It can be used to identify
genomic loci that show transcriptional activities and transcription factor
binding patterns of interest.
I. Introduction
II. Usage
II.1 General
II.2 tilemap_importaffy
II.3 tilemap_norm
II.4 tilemap
II.5 tilemap_extract
II.6 tilemap_plot for visualization
III. Examples
III.1 Example 1
III.2 Example 2
III.3 Example 3
III.4 Example 4
IV. Input
File Format
IV.1 Raw Data File (Standard TileMap Data Format)
IV.2 *.cmpinfo File
IV.3 import_affy_parameter file
IV.4 norm_parameter file
IV.5 tilemap_parameter file
IV.6 extract_parameter file
V.
Output File Format
V.1 *.refmask
V.2 *_pb.sum
V.3 *_hmm.sum
V.4 *_ma.sum
V.5 *.bed
V.6 *.reg
V.7 *_transp.txt
V.8 *_emissp.txt
V.9 file exported by tilemap_extract
/* ------------------------ */
/* Introduction
*/
/* ------------------------ */
TileMap consists of the
following parts:
(1) tilemap_importaffy
(2) tilemap_norm
(3) tilemap
(4) tilemap_extract
(5) sample R/Matlab code tilemap_plot.R (or .m) for visualizing tilemap results
tilemap_importaffy can be used to import data from affymetrix arrays and
preprocess the data (normalization & create local repeat filters). It converts
*.CEL (version 3, ASCII file) and
*.BPMAP files into the
standard tilemap data
format (see file format section).
tilemap_norm can be used to do quantile normalization (Bolstad et al., 2003). If users want to analyze
non-affymetrix data, they can provide their own data in the
standard tilemap
data format, and use tilemap_norm to do normalization. Users do not need to use
tilemap_norm if tilemap_importaffy was used to import raw data.
tilemap_importaffy already provides options to do normalization.
tilemap is the central part of TileMap. It (i) computes probe-level
test-statistics according to the transcriptional or protein binding patterns
specified by users; (ii) filters local repeats; and (iii) infers if a region is
of interest or not by applying HMM or Moving Window Average (MA). The output of tilemap includes
*.sum files which provide summary
statistics for each probe, a *.bed file which
reports regions of interest, and a *.reg file which sorts
the reported regions from high to low significance level. *.bed file can be uploaded directly to
UCSC
genome browser to visualize the reported regions.
tilemap_extract can be used to retrieve probes and their summary statistics in
user-specified regions. The retrieved data will be saved in tab-delimited ASCII
files. These files can be easily loaded into R,
Matlab etc. for visualization.
Since TileMap does not provide an integrated GUI at current stage, we provide
tilemap_extract and R/Matlab sample code to facilitate the visual checking of
the data and tilemap results. We plan to delevop an integrated visualization
system in future which will integrate TileMap with downstream analysis programs.
/* ------------------------ */
/* Usage
*/
/* ------------------------ */
#########################
# General
#
#########################
(1) Windows Users
Choose Windows "Start" menu, click "Run", type "cmd", press
Enter. A window will
be opened for you to input commands. Enter the directory where TileMap is installed
using "cd [path]" command, and run TileMap as in examples below.
(2) Unix Users
Enter the directory where TileMap is installed, and run TileMap as in examples
below.
(3) File Formats are defined in "Input File Format" and "Output File Format"
section.
#########################
# tilemap_importaffy #
#########################
[Usage]
> tilemap_importaffy [importaffy_parameter_file]
One also needs to have
(a) raw *.CEL (version 3, an ASCII file) files, and
(b)
*.BPMAP file (can be downloaded from affymetrix website)
#########################
# tilemap_norm
#
#########################
[Usage]
> tilemap_norm [norm_parameter_file]
One also needs to have
(a) a raw data file in standard tilemap data format.
#########################
# tilemap
#
#########################
[Usage]
> tilemap [tilemap_parameter_file]
One also needs to have
(a) a normalized data file in standard tilemap data format, and
(b) a *.cmpinfo file to specify which pattern one wish select.
#########################
# tilemap_extract
#
#########################
[Usage]
> tilemap_extract [extract_parameter_file]
One needs to have
(a) summary statistics computed by tilemap.
########################
# visualization
#
########################
[Usage]
(1) In MatLab:
>
tilemap_plot('[file name]')
e.g.
> tilemap_plot('cMycA_tile_chr21_14676034_14678449.txt')
(2) In R:
First, in tilemap_plot.R
find a line started with "datapath <- ", and edit the line to provide a filename
that specifies the data for plotting, e.g.
datapath <-
"cMycA_tile_chr21_14677034_14677449.txt"
Then run
tilemap_plot.R
(3) Users can
modify Matlab and R codes to meet their own needs.
/* ------------------------ */
/* Examples
*/
/* ------------------------ */
Below are several examples of using TileMap. The argument files are mainly
in "[tag] = [value]" format. Details about the file formats can be found in
"Input File Format" and "Output File Format" section. Users need to edit the
parameter files before using TileMap. Only the [value] part needs to be edited.
Please do not make any changes to the tags, otherwise the program may interpret
parameters in a wrong way.
To help beginning users, detailed instructions on how to set parameters are
given in the sample argument files in the examples. For experienced
users, succinct versions of these files can be found here:
Succinct files:
sample tilemap_importaffy parameter
file
sample tilemap_norm parameter file
sample tilemap parameter file
sample *.cmpinfo file
sample tilemap_extract parameter file
Both succinct files
and files with detailed instructions can be used as input of TileMap. TileMap
will automatically ignore lines starting with '#' in parameter files.
#########################
# Example 1
#
#########################
The goal is to analyze affymetrix tiling arrays. One needs to load raw data from
*.CEL and *.BPMAP files, to do normalization, to apply Tilemap (HMM), and to
visualize a specific region.
/* step 1: load *.CEL and *.BPMAP file, normalization, prepare local repeat
filter */
> tilemap_importaffy sample1_importaffy_arg.txt
/* step 2: specify transcriptional or protein binding patterns of interest in *.cmpinfo
file */
refer to sample1.cmpinfo
/* step 3: tilemap main procedure */
> tilemap sample1_tilemap_arg.txt
/* step 4: export data for visualization */
> tilemap_extract sample1_extract_arg.txt
/* step 5: visualization */
refer to sample R/Matlab code,
run tilemap_plot in Matlab or R.
#########################
# Example 2
#
#########################
Users have their own non-affymetrix data. The data has been organized in
standard tilemap data format. Users want to do normalization, apply TileMap (MA)
to call regions of interest, and use UMS to estimate local false discovery rate.
/* step 1: normalization */
> tilemap_norm sample2_norm_arg.txt
/* step 2: prepare *.cmpinfo file to specify patterns of interest and *.refmask
file to define local repeat filters */
refer to sample2.cmpinfo and
sample2.refmask
/* step 3: tilemap */
> tilemap sample2_tilemap_arg.txt
#########################
# Example 3
#
#########################
The probe level test-statistics has been calculated in Example 2. Users want to apply
HMM-based TileMap to the existing probe level test-statistics. Users wish to
provide their own selection statistics for UMS. There is no need to do local
repeat filtering.
> tilemap sample3_tilemap_arg.txt
#########################
# Example 4
#
#########################
The normalized data are available. Users want to apply MA-based tilemap and
estimate local false discovery rate by permutation test. There is no need to do
local repeat filtering.
> tilemap sample4_tilemap_arg.txt
Users also need to prepare sample4.cmpinfo and
specify the way to do permutation there.
/* ------------------------ */
/* Input File Format
*/
/* ------------------------ */
##################################
# Raw Data File
#
# (Standard TileMap Data Format) #
##################################
This is a tab-delimited file. Raw tiling array data are organized in the
following format.
1st row: 'chromosome', 'position', array ids
2nd row and after: each row corresponds to a probe; probes should be arranged in
the same order as they appear in the genome.
1st col(column): chromosome name
2nd col: genomic coordinate of the probe
3rd col and after: probe intensity data. Each column corresponds to an array.
##################################
# .cmpinfo File
#
##################################
Fill out the parameters in the file. Please do not change the format of the
file, leave tags such as "[Array number] = " in their original form.
(A) In "Basic Info" section, one needs to provide general information about the
tiling array experiment.
[Array number]: the number of arrays to be analyzed.
[Group number]: the number of experimental conditions.
[Group ID]: numerical group ID for individual arrays. The ID's are arranged in
the same order as the order arrays (columns) appear in the raw data file. IDs
range from 1 to [Array number]. Negative integers can be used
if one wish to ignore a specific column in the raw data file.
For example, if three mice strain "wt", "mt1" and "mt2" were profiled, each with
6 replicates. The 18 arrays are arranged in the raw data file as:
{chromosome position wt wt wt mt1 mt1 mt1 mt2 mt2 mt2 wt wt wt mt1 mt1 mt1 mt2
mt2 mt2}
then
[Array number] = 18
[Group number] = 3
[Group ID]
1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
If one wishes to exclude the last nine arrays from the analysis, one can set
[Array number] = 9
[Group number] = 3
[Group ID]
1 1 1 2 2 2 3 3 3 -1 -1 -1 -2 -2 -2 -3 -3 -3
(B) In "Patterns of Interest" section, one sets the transcriptional or protein
binding patterns of interest.
For example, if one wants to select regions that show "mt1<wt<mt2", the criteria
can be set as [Comparisons]
(2<1) & (1<3)
If one wants to select regions that show "mt1<wt OR wt<mt2", the criteria can be
set as
[Comparisons]
(2<1) | (1<3)
Currently, we only support the following operations:
< -- (less than)
> -- (greater than)
& -- (and)
| -- (or)
() -- (to specify operational priorities)
(C) In "Preprocessing" section, one specifies how to truncate low expression
values and whether log-transformation should be taken before analysis.
For example, if one wishes to truncate all intensities that are less than 2 and
set them to be 2,
and wishes to take log2 transformation after the truncation, one
can set
[Truncation lower bound] = 2.0
[Take log2 before calculation?] (1:yes; 0:no) = 1
If one has already done truncations and log-transformations in tilemap_importaffy or
tilemap_norm, one doesn't need to do preprocessing again. In this case, one can
set
[Truncation lower bound] = -1000000000000.0
[Take log2 before calculation?] (1:yes; 0:no) = 0
(D) In "Simulation Setup" section, one specifies how many Monte Carlo draws
should be made to estimate the posterior probability that a probe satisfies the
pattern of interest. If the comparisons specified in "Patterns of Interest"
section is a two sample comparison (e.g. "1<2"), there is no need to do Monte
Carlo, therefore one can set
[Monte Carlo draws for posterior prob.] = 0
If the "Patterns of Interest" involves a multiple sample comparison (e.g. "(1<2) &
(1<3)"), one needs to specify a positive number, for example
[Monte Carlo draws for posterior prob.] = 1000
(E) In "Common Variance Groups", one needs to specify which experimental
conditions are assumed to have common variance. The variance shrinking will be
based on this setting. For example, if there are six conditions, and one
assumes that for each probe, condition 1,2,3 have common within-condition
variance, condition 4,5,6 have common within-condition variance, but the
within-condition variance for 1,2,3 is different from within-condition variance
for 4,5,6, then there are 2 common variance groups, and one can set:
variance group = 2
1 2 3
4 5 6
Each line below "variance group" tag corresponds to a common variance group,
which contains all the conditions that are assumed to have common variance. The
variance shrinking will be done within each variance group.
If you are not sure how to set variance group, you can assume that all
conditions have the same variance. For example, you can set
variance group = 1
1 2 3 4 5 6
Setting variance group appropriately can increase the sensitivity of the
analysis, especially when the number of replicate arrays are small.
(F) In "Permutation Setup", one specifies how to do permutations if one
chooses to use permutation test to estimate local false discovery rate in MA.
[Number of permutations]: how many times to permute group labels.
[Exchangeable groups]: conditions that can be permuted.
For example, if one set
[Number of permutations] = 10
[Exchangeable groups] = 2
1 2 3
4 5 6
then the labels in "Group ID" will be permuted 10 times for computing FDR. The
labels are permuted according to "Exchangeable groups". Arrays labeled by 1,2
and 3 will only be permuted with arrays labeled by 1, 2 and 3. Similarly, arrays
labeled by 4, 5 or 6 will only be permuted with arrays labeled by 4, 5 and 6. No
permutations will be done between "1, 2, 3" and "4, 5, 6". In
other words, the FDR computed is a FDR for a null hypothesis H0: "1=2=3, 4=5=6".
If one wish to compute a FDR for H0: "1=2=3=4=5=6", one can set
[Number of permutations] = 10
[Exchangeable groups] = 1
1 2 3 4 5 6
If one does not want to use permutation test to compute FDR, one can set
[Number of permutations] = 10
[Exchangeable groups] = 1
1 2 3 4 5 6
Depending on the data size, permutation test may require a long time. Moreover,
it is hard to estimate FDR for H0: "not {1<2<3}" using permutation test.
##################################
# importaffy_parameter_file
#
##################################
[Working directory]
The directory that contains *.CEL and *.BPMAP files. All the results generated
by TileMap will be exported to this directory.
[BPMAP file]
The name of the *.BPMAP file. This file should be placed in the working
directory. It will be used to sort the probes according to their genomic location
and to generate a local repeat filter.
[Export file]
Please specify a file to save the converted data. The raw *.CEL data will be
exported into [working directory]\[export file] in the standard tilemap data
format.
[Array number]
Number of arrays.
[Arrays]
Each line below [Arrays] represent an array. Each line contains two columns,
separated by a tab, the first column gives the name of the *.CEL file, and the
second column gives the name of the array (provided by users to specify e.g.
experimental conditions ...). For example:
IP_5_3A.CEL Jurkat_anti-cMyc_A_1_1
IP_5_4A.CEL Jurkat_anti-cMyc_A_1_2
IP_5_5A.CEL Jurkat_anti-cMyc_A_1_3
IP_1_3A.CEL Jurkat_anti-GST_A_1_1
IP_1_4A.CEL Jurkat_anti-GST_A_1_2
IP_1_5A.CEL Jurkat_anti-GST_A_1_3
The number of *.CEL files should match the [Array number].
[Apply normalization before computing intensity]
Whether or not you want to do normalization before computing probe intensities.
[Truncation lower bound before normalization]
If you choose to do normalization, you need to specify how to truncate low
expression values. All values < [truncation lower bound] will be set to
[truncation lower bound] before normalization.
[Take log2 transformation before normalization]
Whether or not you wish to take log2 transformation before normalization. If you
choose yes, the truncated values will be log-transformed, and the normalization
will be applied to the transformed values. If you choose no, the normalization
will be applied to the un-log-transformed values, and you can choose to do
log-transformation later.
[How to compute intensity]
You can choose to use normalized PM values as the probe intensity; or you can
choose to use PM-MM as the intensity.
[Truncation lower bound after intensity computation]
After you compute PM only or PM-MM intensities, how would you truncate low
intensities. All intensities < [truncation lower bound] will be set to
[truncation lower bound]. If you have already taken log-transformation before,
you may need to set a small number here such as -10000000000.0.
[Take log2 transformation after intensity computation]
Whether or not you wish to take log2 transformation after you get intensities.
If you have already carried out log-transformation before normalization, you
should choose "no" here.
##################################
# norm_parameter_file
#
##################################
[Working directory]
The directory that contains the raw data file. All the results generated by
TileMap will be exported to this directory.
[Raw Data file]
The name of the raw data file. It should be placed in the working directory and
should be in standard tilemap data format.
[Export file]
The name of the file where normalized data will be saved. This file will be
generated in the working directory.
[Array number]
Number of arrays.
[Truncation lower bound before normalization]
You need to specify how to truncate low expression values. All values <
[truncation lower bound] will be set to [truncation lower bound] before
normalization.
[Take log2 transformation before normalization]
Whether or not you wish to take log2 transformation before normalization. If you
choose yes, the truncated values will be log-transformed, and the normalization
will be applied to the transformed values. If you choose no, the normalization
will be applied to the un-log-transformed values, and you can choose to do
log-transformation later in tilemap.
##################################
# tilemap_parameter_file
#
##################################
O.1-[Working directory]
The directory that contains raw data files. All the results generated by TileMap
will be exported to this directory.
O.2-[Project Title]
A title of the project. This title will be used to generate names of output
files.
I.1-[Compute probe level test-statistics?]
Specify whether or not tilemap should compute the probe level test-statistics.
If one only has normalized raw data, one should choose "yes". If one
has already pre-computed probe level test-statistics and only wants to apply HMM or MA to
do region level inference, one can choose "no".
I.2-[Raw data file]:
If you choose "yes" in I.1, you need to prepare two files in the working
directory:
(i) A raw data file which contains the normalized probe intensities. This file
should be in standard tilemap data format, and should be placed in the working
directory. Give its name in I.2.
(ii) You also need to prepare a *.cmpinfo file named {Project Title}.cmpinfo
in the working directory, which specifies the hybridization pattern you wish to
select. However, you DON'T need to provide its name in I.2.
If you choose "no" in I.1, please prepare a file that contains precomputed probe
level test-statistics. This file should be in *_pb.sum format (see "Output File
Format") and should be placed in the working directory. Provide its name in I.2.
It will be used as the input for HMM and MA. In this case, the probe level computation
embedded in TileMap will be skipped.
NOTICE: in tilemap, small values of probe level test-statistics correspond to patterns of
interest. When you provide your own probe level test-statistics, you may need to
transform them somehow to follow this
convention.
I.3-[Range of test-statistics]
Specify the range of probe level test-statistics.
If you choose "yes" in I.1, you can set I.3 to 0 (default). Tilemap will compute
probe level test statistics and determine the range automatically. For two
sample comparisons, the probe level test-statistic is an improved t-statistic,
the range will be (-inf, +inf). For multiple sample comparisons, the probe level
test-statistic is a posterior probability, the range will be [0,1].
If you choose "no" in I.1 and provide your own probe level test-statistics,
then you should set I.3 either to 1 {[0,1]} or 2 {(-inf, +inf)} depending on
whether the statistics you provided in I.2 fall within [0,1] (e.g. posterior
probability) or (-inf, +inf) (e.g. t-statistics).
[0,1] statistics will be transformed by log[t/(1-t)] before applying MA, and the
MA statistics will be transformed back by exp(u)/[exp(u)+1] before applying UMS
to estimate local FDR. (-inf, +inf) statistics will be transformed by exp(t)/[exp(t)+1]
before applying HMM.
I.4-[Zero cut]
To avoid logit(0), please specify a zero cut in I.4. [0,1] test-statistics will
be set to max(zero_cut/2, min(t, 1-zero_cut/2)) before taking logit
transformation. E.g. If [Monte Carlo draws for posterior prob.] = 10000 in *.cmpinfo
file, you can set zero_cut = 0.0001.
II.1-[Apply local repeat filter?]
Whether or not you want to mask local repeats. Some probes occur more than once
in a region, such local repeats may result in noise due to
cross-hybridizations. You may wish to exclude these probes from analysis. If so,
you need to apply the filter. If the data you provided have already
been repeat-masked, you can choose "no" to skip this step.
II.2-[*.refmask file]
If you choose "yes" in II.1, please prepare a *.refmask file (see "Output File
Format") which provides non-redundant probes and counts how many times each
probe occur in a local region. You need to provide its name in II.2. This file
will be used as a reference for masking local repeats.
Hint: using tilemap_importaffy to load affymetrix data from *.CEL and *.BPMAP
will automatically create a *.refmask file.
If you choose "no" in II.1, set II.2 to NULL. Local repeat filtering will be
skipped.
III.1-[Combine neighboring probes?]
Whether or not you want to apply HMM or MA to do region inference. If you choose
no, tilemap will skip HMM and MA. If you choose yes, tilemap will combine
neighboring probes to infer whether a region is of interest or not.
III.2-[Method to combine neighboring probes]
Choose which method should be used to do region summary.
If you choose "Yes" in III.1 and "HMM" in III.2, please fill out Step IV and
leave Step V to its default values.
If you choose "Yes" in III.1 and "MA" in III.2, please fill out Step V and leave
Step IV to its default values.
If you choose "No" in III.1, you can set III.2 arbitrarily to 0 or 1 and leave
both Step IV and Step V to their default values. Region summary will then be skipped.
IV.1-[Posterior probability >]
Posteriror probability cutoff to call regions of interest in HMM.
IV.2-[Maximal gap allowed]
d0 in HMM. If the distance between the neighboring probes i and i+1, d(i,i+1),
is no greater than d0, tilemap will use the HMM transition probability matrix to
compute likelihood. If d(i,i+1) > d0, tilemap will restart a new HMM from
i+1. (refer to Ji&Wong, 2005 for details).
IV.3-[Method to set HMM parameters]
You can choose to use UMS embedded in tilemap to get HMM parameters or to
provide your own HMM parameters.
If you choose "UMS" in IV.3, please set UMS parameters in IV.4 - IV.10.
Otherwise leave them to be default values.
If you choose "Set by users" in IV.3, please provide your own transition,
emission probability matrices in _transp.txt and
_emissp.txt format. You should place these two files
in the working directory and provide their names in IV.9 and IV.10. Otherwise set
IV.9 and IV.10 to
be NULL.
IV.4-[Provide your own selection statistics?]
If you choose to use UMS to get HMM parameters, you have the option to provide
your own selection statistics. If you do not provide your own selection
statistics, tilemap will use the probe level test-statistics as the default selection
statistics.
IV.5-[If Yes to IV.4, selection statistics file]
If you choose to provide your own selection statistics, please prepare the
statistics in a *_pb.sum file and provide its name in IV.5. The file should be
in working directory.
IV.6-[G0 Selection Criteria, p%]
Set t(p) in UMS. If a probe has a selection statistic > t(p), its downstream
probe will be used to construct g0.
IV.7-[G1 Selection Criteria, q%]
Set t(q) in UMS. If a probe has a selection statistic <= t(q), its downstream
probe will be used to construct g1.
IV.8-[Selection Offset]
If probe i has a selection statistic > t(p), probe (i+selection_offset) will be
used to construct g0. Similar for g1.
IV.9-[Grid Size]
How many intervals should [0,1] be divided into. For example, if grid size =
1000, [0,1] will be divided into 0.001, 0.002, ..., 1.000. g0 and g1 will be
estimated by empirical distributions on this grid. The choice of grid size
should consider the number of probes available. On average, it would be better
to have a few hundred probes in each interval.
IV.10-[Expected hybridization length]
The number of probes contained in a typical hybridization region. For example,
in ChIP-Chip experiment, if IP fragment length = 1000bp, probe density= 1 probe
/ 35 bp. Then one would expect to observe 28 probes on average in a binding
region, and one can set expected hybridization length = 28.
IV.11-[Path to transition probability matrix]
If you choose "Set by users" in IV.3, please prepare a transition probability
matrix in working directory and in _transp.txt format (see "Output File
Format"). Provide its name here.
IV.12-[Path to emission probability matrix]
If you choose "Set by users" in IV.3, please prepare a emission probability
matrix in working directory and in _emissp.txt format (see "Output File
Format"). Provide its name here.
V.1-[Local FDR <]
Local false discovery rate cutoff to call regions of interest in MA.
V.2-[Maximal gap allowed]
Two signifcant probes, if their distance <= [maximal gap allowed], will be treated
as a single region. For example, in ChIP-Chip experiment, if IP fragment
length = 1000bp, one can set maximal gap allowed = 500, half of the IP fragment
length.
V.3-[W]
The half window size. The moving average will be taken over a 2*W+1 window, i.e.
each window will contain 2*W+1 probes.
V.4-[Method to compute local FDR]
You can choose to use UMS or permutation test to compute local FDR.
If you choose "UMS" in V.4, please set UMS parameters in V.5 - V.10. If you
choose "Permutation Test" in V.4, please set grid size in V.10, and
then go back to
{Project Title}.cmpinfo file
and fill out its "Permutation Setup" section. There you will set the way to do
permutations and number of permutations you want to do.
Hint: depending on the size of the data, permutation test could be very slow.
V.5-[Provide your own selection statistics?]
If you choose to use UMS to get local FDR, you have the option to provide your
own selection statistics in UMS. If you choose not to provide your own selection
statistics, tilemap will use probe level test-statistics as the selection
statistics.
V.6-[If Yes to IV.4, selection statistics file]
If you choose to provide your own selection statistics, please prepare the
statistics in a *_pb.sum file and provide its name here. The file should be in
working directory.
V.7-[G0 Selection Criteria, p%]
Set t(p) in UMS. If a probe has a selection statistic > t(p), its downstream
probe will be used to construct g0.
V.8-[G1 Selection Criteria, q%]
Set t(q) in UMS. If a probe has a selection statistic <= t(q), its downstream
probe will be used to construct g1.
V.9-[Selection Offset]
If probe i has a selection statistic > t(p), probe (i+selection_offset) will be
used to construct g0. Similar for g1. Usually, selection offset = W+1 in MA.
V.10-[Grid Size]
How many intervals should [0,1] be divided into. For example, if grid size =
1000, [0,1] will be divided into 0.001, 0.002, ..., 1.000. g0 and g1 will be
estimated by empirical distributions on this grid. The choice of grid size
should consider the number of probes available. On average, it would be better
to have a few hundred probes in each interval.
##################################
# extract_parameter_file
#
##################################
[Working directory]
The directory that contains raw data file and all the tilemap-generated files.
All the new files generated by this command will be exported to this directory.
[Project Title]
The title of the project. The program will automatically extract data from
[Project Title]_f_pb.sum, [Project Title]_hmm.sum, [Project Title]_ma.sum and
the raw
data file if available.
[Probe Level Summary]
You are required to provide a probe level summary file in _pb.sum format. The
program will decide which probes to retrieve based on this file.
[Raw Data]
You can provide the name of the raw data file. This file should be in working
directory. If provided, raw data will be extracted. If you set NULL here, the
program will only get test-statistics for target probes. No raw data will be
extracted.
[Regions]
Each line below [Regions] represent a region you wish to extract. Each line
contains at least three columns, tab-delimited.
col1: chromosome name
col2: start coordinate in the chromosome
col3: end coordinate in the chromosome
other columns: defined by users themselves.
e.g.
chr21 14676034 14678449 target 651 +
chr21 17421450 17423637 target 651 +
chr21 18111757 18113819 target 651 +
chr21 26027537 26031330 target 651 +
for each region, the program will generate a file named [Project
title]_[chromosome]_[start]_[end].txt in the working directory. The file will
contain all the probes in the specified region, their coordinates and summary
statistics.
/* ------------------------ */
/* Output File Format
*/
/* ------------------------ */
##################################
# *.refmask
#
##################################
This is a tab-delimited file used for sorting probes based on their genomic
coordinates and for filtering local repeats.
col1: chromosome name
col2: coordinate in the chromosome
col3: how many probes in the array are mapped (without any mismatches) to the
position specified by col1 and col2.
col4: within a local window (2000 bp as the tilemap default) centered at the
col1-col2, how many genomic loci have the same probe sequence as the sequence
specified by col1-col2. If the number is >1, the probe in question will be
treated as a local repeat and will be filtered
out later.
col5: probe sequence
##################################
# *_pb.sum
#
##################################
This is a tab-delimited file to record probe level test-statistics.
col1: chromosome name
col2: coordinates in the chromosome
col3: probe-level test-statistics. The statistics are transformed such that the
smaller the statistics, the more significant.
##################################
# *_hmm.sum #
##################################
This is a tab-delimited file to record posterior probability generated by HMM.
col1: chromosome name
col2: coordinates in the chromosome
col3: posterior probability that a probe is in a region of interest. The larger
the posterior probability, the more significant a probe is.
##################################
# *_ma.sum #
##################################
This is a tab-delimited file to record MA summaries.
col1: chromosome name
col2: coordinates in the chromosome
col3: Moving average (MA) statistics
col3: Local false discovery rate that a probe is in a region of interest. The
smaller the local FDR, the better.
##################################
# *.bed #
##################################
This is a UCSC *.bed file to report significant regions. Regions are sorted
according to their genomic locations.
col1: chromosome name
col2: region start
col3: region end
col4: no meaning
col5: 1000*[hmm posterior probability] or 1000*(1-lfdr of MA)
col6: always +
##################################
# *.reg #
##################################
This is a tab-delimited file to report significant regions. Regions are ranked according to their significance levels.
col1: chromosome name
col2: start coordinate
col3: end coordinate
col4: the line # of the starting probes (to help locate the probe in *_pb.sum, *_hmm.sum and *_ma.sum
files)
col5: the line # of the ending probe
col6: maximum posterior probability or minimum local FDR of all the probes in the
region
col7: mean posterior probability or mean local FDR of the regions. If the region is
formed by merging two discrete regions that are separated by less than [maximal gap],
then the mean
is
obtained as follows: first, compute two means for the two discrete regions separately;
then, take the minimum of the two means and use it as the mean here.
##################################
# _transp.txt #
##################################
HMM transition probability matrix, in the format:
1-a0 a0
a1 1-a1
##################################
# _emissp.txt #
##################################
HMM emission probability matrix, in the format:
interval(1) interval(2) interval(3) ... interval(n)
f0(1) f0(2) f0(3) ... f0(n)
f1(1) f1(2) f1(3) ... f1(n)
A probe level test-statistic, t, if interval(i-1)<t<=interval(i), then f0(t)=f0(i) (the likelihood for H=0); and f1(t)=f1(i)
(the likelihood for H=1).
NOTICE: interval(i) should equally divide [0,1], i.e. interval(i+1)-interval(i)
= interval(i)-interval(i-1). interval(n) is always 1.0. Although not explicitly defined in the file,
interval(0)=0.
##################################
# file exported by #
# tilemap_extract #
##################################
This is a tab-delimited file.
col1: probe coordinate in chromosome
col2: probe level test-statistics
col3: HMM posterior probability
col4: MA statistics
col5: local FDR for MA
col6 and after: raw data
REFERENCES:
Bolstad, B.M.,
Irizarry, R.A., Astrand, M. and Speed, T.P. (2003) A comparison of normalization
methods for high density oligonucleotide array data based on variance and bias,
Bioinformatics, 19(2), 185-193.
Ji,
H. and Wong, W.H. (2005) TileMap: create chromosomal map of tiling array
hybridizations. (Submitted).
|