Introduction¶
Genome stores genetic information, whereas epigenome controls the reading of these genetic information; it plays a central role in regulating how and when the genome is organized, duplicated and transcribed. Epigenome is therefore highly dynamic and multidimensional. It is comprised of:
Chromatin accessibility
small RNA expression
Disruptions of epigenome are frequently linked to various diseases including cancer. For example, hypermethylation of CpG islands at tumour suppressor genes switches off these genes, whereas global hypomethylation leads to genome instability and inappropriate activation of oncogenes and transposable elements.
Because of the dynamic and multi-dimentional nature of epigenomics, integrative analysis of different type of data using system biology approach is usually more effective than only looking at one individual data type. And because epigenome is tissue specific, such integrative analysis should be performed on data generated from the same tissue or cell type. Here we focus on commonly used cell lines including LNCaP, VCaP, LNCaP-abl, MCF7, GM12878, K562, HeLa-S3, A549 and HePG2 and collect published epigenetic datasets generated from these cells. Taking advantages of published resources, Epidaurus facilitates integrative analysis of cancer epigenome, and enable researchers to quickly visualize epigenetic landscape around genome regions of interest (e.g. transcription factor binding sites, TSSs, DNA motifs), and is very useful to validate hypothesis and generate new biological insights.
Epidaurus online web server has 233 prebuilt datasets for users to analyze. Source code is also freely available if user want to analyze their own private data (see below for detailed instructions).
Prebuilt Data¶
Type |
Number of datasets |
---|---|
LNCaP |
59 |
LNCaP-Abl |
7 |
VCaP |
10 |
MCF7 |
59 |
GM12878 |
16 |
K562 |
24 |
HeLa-S3 |
18 |
A549 |
5 |
HePG2 |
14 |
Genome features |
19 |
Web server¶
This web server provides graphic user interface to facilitate users to analyze prebuilt datasets. However, if users want to analyze their own private data, they need to download and install Epidaurus following instruction below.
Install Epidaurus¶
Install/upgrade epiprofile using pip3¶
$ pip3 install epiprofile
$ pip3 install epiprofile --upgrade
Run epidaurus¶
Epidaurus needs 2 input files:
Config file specifies options to run epidaurus.py, it also specifies which datasets will be included in your analysis. (see below for details)
BED file contains genome regions you want to analyze. For example, it can be transcription factor binding sites (TFBS) identified from ChIP-seq or genome features such as transcription start site (TSS) or DNA motifs.
NOTE:
Input BED file must contain at least 3 columns (chrom, start, end). Other columns (if any) will be ignored.
When HALF_WINDOW_SIZE > 0. Regardless of actual span size defined in each row of the BED file, Epidaurus always takes the middle point and extend HALF_WINDOW_SIZE (default is 1000 bp)
When HALF_WINDOW_SIZE = 0. Epidaurus uses the original regions provided in BED file without extension, in this case all genomic regions in input BED file must be the same size.
By default, Epidaurus only retrieves signals for the first 2000 (HEAD_ROWS = 2000, can be adjusted in Config file) rows of BED file. Therefore, it’s better to rank input BED file according to peak intensity, p-value, etc.
All epigenetic datasets must be prepared in BigWig format.
# Step 1: Install Epidaurus (see above for instructions)
# Step 2: Download test dataset (see above for instructions)
# Step 3:
$ unzip Epidaurus_test.zip
Archive: Epidaurus_test.zip
inflating: LNCaP_AR_ChIPseq_DHT_GSM686917.100M.bw
inflating: LNCaP_FoxA1_ChIPseq_DHT_GSM686926.bw
inflating: LNCaP_GROseq_DHT_GSM686949.bw
inflating: LNCaP_H2A.Z_ChIPseq_DHT_GSM686941.bw
inflating: LNCaP_H3K27ac_ChIPseq_DHT_GSM686937.bw
inflating: LNCaP_H3K27me3_DHT_GSM969571.bw
inflating: LNCaP_H3K4me1_ChIPseq_DHT_GSM686928.bw
inflating: LNCaP_H3K4me2_ChIPseq_DHT_GSM686932.bw
inflating: test_config.txt
inflating: test_half_ARE.bed
# Step 4: run epidaurus.py from the directory where bigwig files are located. If you run it from other directory, you MUST edit *test_config.txt* file
$ python2.7 epidaurus.py test_config.txt test_half_ARE.bed output
Parameters:
HM_FORMAT pdf
DIST_METRIC kendall
HALF_WINDOW_SIZE 1000
HEAD_ROWS 2000
Extend 1000 bp to the middle of test_half_ARE.bed file
total 2000 lines loaded!
extracting signals from SEED: AR -> ./LNCaP_AR_ChIPseq_DHT_GSM686917.100M.bw
[1/7] extracting signals from file: GROseq -> ./LNCaP_GROseq_DHT_GSM686949.bw ...
[2/7] extracting signals from file: H3K4me1 -> ./LNCaP_H3K4me1_ChIPseq_DHT_GSM686928.bw ...
[3/7] extracting signals from file: H2AZ -> ./LNCaP_H2A.Z_ChIPseq_DHT_GSM686941.bw ...
[4/7] extracting signals from file: H3K4me2 -> ./LNCaP_H3K4me2_ChIPseq_DHT_GSM686932.bw ...
[5/7] extracting signals from file: H3K27ac -> ./LNCaP_H3K27ac_ChIPseq_DHT_GSM686937.bw ...
[6/7] extracting signals from file: FoxA1 -> ./LNCaP_FoxA1_ChIPseq_DHT_GSM686926.bw ...
[7/7] extracting signals from file: H3K27me3 -> ./LNCaP_H3K27me3_DHT_GSM969571.bw ...
Output files¶
prefix.r
R script to generate heatmap and linegraph. Note that for each epigenetic dataset (or each row in heatmap), all values were normalized into [0,1] for better visualization effect. This means colors between rows are NOT comparable.
prefix.data.xls
Raw data extracted from bigwig files.
prefix.heatmap.pdf
Heatmap showing epigenome profile centered on input genome regions
prefix.curve.pdf
Ovelayed curves showing epigenome profile centered on input genome regions
Prepare Config file¶
Config file defines the program parameters and epigenetic datsets needed to run Epidaurus
Config file specifications
Lines start with “!” define the parameter-argument pairs. The parameter and argument must be space or Tab separated. For example
Lines start with “%” define bigWig files. The keywords and data path must be space or Tab separated. keywords must be unique across the config file. Keywords should be as concise as possible.
Line start with “&” defines seed BigWig files. The distance (measured by correlation) between the seed bigwig file and each of other datasets will be calculated. Then the order of each datasets on the heatmap will be determined by the value of distance. Only 1 seed BigWig file is allowed.
All other lines will be skipped.
Example of Config file¶
=========================================================================================
Parameters
=========================================================================================
!HALF_WINDOW_SIZE 1000
Description:
1) Specify half window size (bp). First we determine the middle point of a genomic intervals
then extend half window size to both up and down stream.
2) Default =1000 (bp) (recommended for better visualization)
3) If HALF_WINDOW_SIZE is set to 0, all genomic intervals will be kept AS IS. Use this option,
all genomicintervals need to be the same size.
4) 'HALF_WINDOW_SIZE' is reserved word, do NOT change.
!HEAD_ROWS 2000
Description:
1) Specify how many rows in bed file were used. Default = 2000 (only top 2000 rows will be used).
2) 'HEAD_ROWS' is reserved word, do NOT change.
!HM_FORMAT pdf
Description:
Heatmap format. This must be one of ('pdf', 'png', 'tiff'). default: pdf
!DIST_METRIC kendall
Description:
Method to measure distance between seed data and each of the bigwig datasets. Must be one of
("pearson", "kendall", "spearman", "euclidean"). Default: kendall
=========================================================================================
bigWig files
=========================================================================================
below is the seed datset. Indicates with '&'
&AR /data2/bsi/staff_analysis/LNCaP/data/AR.bw
below are other datasets. Indicates wtih '%'
%FoxA1 /data2/bsi/staff_analysis/LNCaP/data/FoxA1_ChIPseq_DHT_GSM686926.bw
%DnaseI /data2/bsi/staff_analysis/LNCaP/data/DnaseI_Methyltrienolone_GSM816634.bw
%H3K27ac /data2/bsi/staff_analysis/LNCaP/data/H3K27ac_ChIPseq_DHT_GSM686937.bw
%H3K4me1 /data2/bsi/staff_analysis/LNCaP/data/H3K4me1_ChIPseq_DHT_GSM686928.bw
%H3K27me3 /data2/bsi/staff_analysis/LNCaP/data/H3K27me3_DHT_GSM969571.bw
%H3K4me2 /data2/bsi/staff_analysis/LNCaP/data/H3K4me2_ChIPseq_DHT_GSM686932.bw
%H3K4me3 /data2/bsi/staff_analysis/LNCaP/data/H3K4me3_ChIPseq_DHT_GSM686935.bw
%H3K9me3 /data2/bsi/staff_analysis/LNCaP/data/H3K9me3_R1881_GSM353610.bw
%P300 /data2/bsi/staff_analysis/LNCaP/data/P300_ChIPseq_DHT_GSM686943.bw
How to make you own database¶
Since Epidaurus only takes bigwig (a standard data format used by UCSC genome browser) files as input, making your own “database” is fairly simple:
Save bigwig files to your computer
Make Config file pointing to these bigwig files (see above for example)
- NOTE:
You can make your own bigwig files. Wiggle, bedGraph files can be converted into bigwig using WigToBigWig.
You can download bigwig files directly from http://genome.ucsc.edu/ENCODE/ or GEO. Most (if not all) ENCODE datasets have bigwig file available. Most ChIP-seq data on GEO have wiggle/bedGraph/bigwig files available.
All bigwig files in your database must be based on the same genome assembly (e.g. hg19). If not, use CrossMap (anther tool developed by our team) to convert bigwig files to the same genome assembly.
Questions and Feedbacks¶
Please post your questions and feedbacks to Epidaurus google group: