Introduction¶

Genome stores genetic information, whereas epigenome controls the reading of these genetic information; it plays a central role in regulating how and when the genome is organized, duplicated and transcribed. Epigenome is therefore highly dynamic and multidimensional. It is comprised of:

Histone modifications
DNA methylations
Chromatin accessibility
Gene expression
small RNA expression

Disruptions of epigenome are frequently linked to various diseases including cancer. For example, hypermethylation of CpG islands at tumour suppressor genes switches off these genes, whereas global hypomethylation leads to genome instability and inappropriate activation of oncogenes and transposable elements.

Because of the dynamic and multi-dimentional nature of epigenomics, integrative analysis of different type of data using system biology approach is usually more effective than only looking at one individual data type. And because epigenome is tissue specific, such integrative analysis should be performed on data generated from the same tissue or cell type. Here we focus on commonly used cell lines including LNCaP, VCaP, LNCaP-abl, MCF7, GM12878, K562, HeLa-S3, A549 and HePG2 and collect published epigenetic datasets generated from these cells. Taking advantages of published resources, Epidaurus facilitates integrative analysis of cancer epigenome, and enable researchers to quickly visualize epigenetic landscape around genome regions of interest (e.g. transcription factor binding sites, TSSs, DNA motifs), and is very useful to validate hypothesis and generate new biological insights.

Epidaurus online web server has 233 prebuilt datasets for users to analyze. Source code is also freely available if user want to analyze their own private data (see below for detailed instructions).

Prebuilt Data¶

Type	Number of datasets
LNCaP	59
LNCaP-Abl	7
VCaP	10
MCF7	59
GM12878	16
K562	24
HeLa-S3	18
A549	5
HePG2	14
Genome features	19

Web server¶

Epidaurus online web server

This web server provides graphic user interface to facilitate users to analyze prebuilt datasets. However, if users want to analyze their own private data, they need to download and install Epidaurus following instruction below.

Install Epidaurus¶

Prerequisite¶

python3
pip3

Install/upgrade epiprofile using pip3¶

$ pip3 install epiprofile
$ pip3 install epiprofile  --upgrade

Run epidaurus¶

Epidaurus needs 2 input files:

Config file specifies options to run epidaurus.py, it also specifies which datasets will be included in your analysis. (see below for details)

BED file contains genome regions you want to analyze. For example, it can be transcription factor binding sites (TFBS) identified from ChIP-seq or genome features such as transcription start site (TSS) or DNA motifs.

NOTE:

Input BED file must contain at least 3 columns (chrom, start, end). Other columns (if any) will be ignored.

When HALF_WINDOW_SIZE > 0. Regardless of actual span size defined in each row of the BED file, Epidaurus always takes the middle point and extend HALF_WINDOW_SIZE (default is 1000 bp)

When HALF_WINDOW_SIZE = 0. Epidaurus uses the original regions provided in BED file without extension, in this case all genomic regions in input BED file must be the same size.

By default, Epidaurus only retrieves signals for the first 2000 (HEAD_ROWS = 2000, can be adjusted in Config file) rows of BED file. Therefore, it’s better to rank input BED file according to peak intensity, p-value, etc.

All epigenetic datasets must be prepared in BigWig format.

# Step 1: Install Epidaurus (see above for instructions)

# Step 2: Download test dataset (see above for instructions)

# Step 3:
$ unzip Epidaurus_test.zip
Archive:  Epidaurus_test.zip
 inflating: LNCaP_AR_ChIPseq_DHT_GSM686917.100M.bw
 inflating: LNCaP_FoxA1_ChIPseq_DHT_GSM686926.bw
 inflating: LNCaP_GROseq_DHT_GSM686949.bw
 inflating: LNCaP_H2A.Z_ChIPseq_DHT_GSM686941.bw
 inflating: LNCaP_H3K27ac_ChIPseq_DHT_GSM686937.bw
 inflating: LNCaP_H3K27me3_DHT_GSM969571.bw
 inflating: LNCaP_H3K4me1_ChIPseq_DHT_GSM686928.bw
 inflating: LNCaP_H3K4me2_ChIPseq_DHT_GSM686932.bw
 inflating: test_config.txt
 inflating: test_half_ARE.bed

# Step 4: run epidaurus.py from the directory where bigwig files are located. If you run it from other directory, you MUST edit *test_config.txt* file
$ python2.7 epidaurus.py test_config.txt test_half_ARE.bed output

Parameters:
HM_FORMAT           pdf
DIST_METRIC         kendall
HALF_WINDOW_SIZE    1000
HEAD_ROWS           2000

Extend 1000 bp to the middle of test_half_ARE.bed file
total 2000 lines loaded!

extracting signals from SEED: AR -> ./LNCaP_AR_ChIPseq_DHT_GSM686917.100M.bw
[1/7] extracting signals from file: GROseq -> ./LNCaP_GROseq_DHT_GSM686949.bw ...
[2/7] extracting signals from file: H3K4me1 -> ./LNCaP_H3K4me1_ChIPseq_DHT_GSM686928.bw ...
[3/7] extracting signals from file: H2AZ -> ./LNCaP_H2A.Z_ChIPseq_DHT_GSM686941.bw ...
[4/7] extracting signals from file: H3K4me2 -> ./LNCaP_H3K4me2_ChIPseq_DHT_GSM686932.bw ...
[5/7] extracting signals from file: H3K27ac -> ./LNCaP_H3K27ac_ChIPseq_DHT_GSM686937.bw ...
[6/7] extracting signals from file: FoxA1 -> ./LNCaP_FoxA1_ChIPseq_DHT_GSM686926.bw ...
[7/7] extracting signals from file: H3K27me3 -> ./LNCaP_H3K27me3_DHT_GSM969571.bw ...

Output files¶

prefix.r

R script to generate heatmap and linegraph. Note that for each epigenetic dataset (or each row in heatmap), all values were normalized into [0,1] for better visualization effect. This means colors between rows are NOT comparable.

prefix.data.xls

Raw data extracted from bigwig files.

prefix.heatmap.pdf

Heatmap showing epigenome profile centered on input genome regions

prefix.curve.pdf

Ovelayed curves showing epigenome profile centered on input genome regions

Graph Example 1¶

Epigenomic and Genomic landscape around Androgen Receptor (AR) binding motif (ARE)

Graph Example 2¶

Epigenomic and Genomic landscape around CTCF binding motifs

Prepare Config file¶

Config file defines the program parameters and epigenetic datsets needed to run Epidaurus

Config file specifications

Lines start with “!” define the parameter-argument pairs. The parameter and argument must be space or Tab separated. For example

Lines start with “%” define bigWig files. The keywords and data path must be space or Tab separated. keywords must be unique across the config file. Keywords should be as concise as possible.

Line start with “&” defines seed BigWig files. The distance (measured by correlation) between the seed bigwig file and each of other datasets will be calculated. Then the order of each datasets on the heatmap will be determined by the value of distance. Only 1 seed BigWig file is allowed.

All other lines will be skipped.

Example of Config file¶

=========================================================================================
                               Parameters
=========================================================================================

!HALF_WINDOW_SIZE 1000

  Description:
  1) Specify half window size (bp). First we determine the middle point of a genomic intervals
     then extend half window size to both up and down stream.
  2) Default =1000 (bp) (recommended for better visualization)
  3) If HALF_WINDOW_SIZE is set to 0, all genomic intervals will be kept AS IS. Use this option,
     all genomicintervals need to be the same size.
  4) 'HALF_WINDOW_SIZE' is reserved word, do NOT change.

!HEAD_ROWS 2000

  Description:
  1) Specify how many rows in bed file were used. Default = 2000 (only top 2000 rows will be used).
  2) 'HEAD_ROWS' is reserved word, do NOT change.

!HM_FORMAT pdf

 Description:
 Heatmap format. This must be one of ('pdf', 'png', 'tiff'). default: pdf

!DIST_METRIC kendall

 Description:
 Method to measure distance between seed data and each of the bigwig datasets. Must be one of
 ("pearson", "kendall", "spearman", "euclidean"). Default: kendall


=========================================================================================
                              bigWig files
=========================================================================================

 below is the seed datset. Indicates with '&'

&AR            /data2/bsi/staff_analysis/LNCaP/data/AR.bw

 below are other datasets. Indicates wtih '%'

%FoxA1         /data2/bsi/staff_analysis/LNCaP/data/FoxA1_ChIPseq_DHT_GSM686926.bw
%DnaseI        /data2/bsi/staff_analysis/LNCaP/data/DnaseI_Methyltrienolone_GSM816634.bw
%H3K27ac       /data2/bsi/staff_analysis/LNCaP/data/H3K27ac_ChIPseq_DHT_GSM686937.bw
%H3K4me1       /data2/bsi/staff_analysis/LNCaP/data/H3K4me1_ChIPseq_DHT_GSM686928.bw
%H3K27me3      /data2/bsi/staff_analysis/LNCaP/data/H3K27me3_DHT_GSM969571.bw
%H3K4me2       /data2/bsi/staff_analysis/LNCaP/data/H3K4me2_ChIPseq_DHT_GSM686932.bw
%H3K4me3       /data2/bsi/staff_analysis/LNCaP/data/H3K4me3_ChIPseq_DHT_GSM686935.bw
%H3K9me3       /data2/bsi/staff_analysis/LNCaP/data/H3K9me3_R1881_GSM353610.bw
%P300          /data2/bsi/staff_analysis/LNCaP/data/P300_ChIPseq_DHT_GSM686943.bw

How to make you own database¶

Since Epidaurus only takes bigwig (a standard data format used by UCSC genome browser) files as input, making your own “database” is fairly simple:

Save bigwig files to your computer
Make Config file pointing to these bigwig files (see above for example)