Genome stores genetic information, whereas epigenome controls the reading of these genetic information; it plays a central role in regulating how and when the genome is organized, duplicated and transcribed. Epigenome is therefore highly dynamic and multidimensional. It is comprised of:
Disruptions of epigenome are frequently linked to various diseases including cancer. For example, hypermethylation of CpG islands at tumour suppressor genes switches off these genes, whereas global hypomethylation leads to genome instability and inappropriate activation of oncogenes and transposable elements.
Because of the dynamic and multi-dimentional nature of epigenomics, integrative analysis of different type of data using system biology approach is usually more effective than only looking at one individual data type. And because epigenome is tissue specific, such integrative analysis should be performed on data generated from the same tissue or cell type. Here we focus on commonly used cell lines including LNCaP, VCaP, LNCaP-abl, MCF7, GM12878, K562, HeLa-S3, A549 and HePG2 and collect published epigenetic datasets generated from these cells. Taking advantages of published resources, Epidaurus facilitates integrative analysis of cancer epigenome, and enable researchers to quickly visualize epigenetic landscape around genome regions of interest (e.g. transcription factor binding sites, TSSs, DNA motifs), and is very useful to validate hypothesis and generate new biological insights.
Epidaurus online web server has 233 prebuilt datasets for users to analyze. Source code is also freely available if user want to analyze their own private data (see below for detailed instructions).
|Type||Number of datasets|
This web server provides graphic user interface to facilitate users to analyze prebuilt datasets. However, if users want to analyze their own private data, they need to download and install Epidaurus following instruction below.
install epidaurus to default location. $ tar zxf epidaurus-VERSION.tar.gz $ cd epidaurus-VERSION $ python setup.py install install epidaurus to other locations, type: $ python setup.py install --help
Epidaurus needs 2 input files:
- Config file specifies options to run epidaurus.py, it also specifies which datasets will be included in your analysis. (see below for details)
- BED file contains genome regions you want to analyze. For example, it can be transcription factor binding sites (TFBS) identified from ChIP-seq or genome features such as transcription start site (TSS) or DNA motifs.
- Input BED file must contain at least 3 columns (chrom, start, end). Other columns (if any) will be ignored.
- When HALF_WINDOW_SIZE > 0. Regardless of actual span size defined in each row of the BED file, Epidaurus always takes the middle point and extend HALF_WINDOW_SIZE (default is 1000 bp)
- When HALF_WINDOW_SIZE = 0. Epidaurus uses the original regions provided in BED file without extension, in this case all genomic regions in input BED file must be the same size.
- By default, Epidaurus only retrieves signals for the first 2000 (HEAD_ROWS = 2000, can be adjusted in Config file) rows of BED file. Therefore, it’s better to rank input BED file according to peak intensity, p-value, etc.
- All epigenetic datasets must be prepared in BigWig format.
# Step 1: Install Epidaurus (see above for instructions) # Step 2: Download test dataset (see above for instructions) # Step 3: $ unzip Epidaurus_test.zip Archive: Epidaurus_test.zip inflating: LNCaP_AR_ChIPseq_DHT_GSM686917.100M.bw inflating: LNCaP_FoxA1_ChIPseq_DHT_GSM686926.bw inflating: LNCaP_GROseq_DHT_GSM686949.bw inflating: LNCaP_H2A.Z_ChIPseq_DHT_GSM686941.bw inflating: LNCaP_H3K27ac_ChIPseq_DHT_GSM686937.bw inflating: LNCaP_H3K27me3_DHT_GSM969571.bw inflating: LNCaP_H3K4me1_ChIPseq_DHT_GSM686928.bw inflating: LNCaP_H3K4me2_ChIPseq_DHT_GSM686932.bw inflating: test_config.txt inflating: test_half_ARE.bed # Step 4: run epidaurus.py from the directory where bigwig files are located. If you run it from other directory, you MUST edit *test_config.txt* file $ python2.7 epidaurus.py test_config.txt test_half_ARE.bed output Parameters: HM_FORMAT pdf DIST_METRIC kendall HALF_WINDOW_SIZE 1000 HEAD_ROWS 2000 Extend 1000 bp to the middle of test_half_ARE.bed file total 2000 lines loaded! extracting signals from SEED: AR -> ./LNCaP_AR_ChIPseq_DHT_GSM686917.100M.bw [1/7] extracting signals from file: GROseq -> ./LNCaP_GROseq_DHT_GSM686949.bw ... [2/7] extracting signals from file: H3K4me1 -> ./LNCaP_H3K4me1_ChIPseq_DHT_GSM686928.bw ... [3/7] extracting signals from file: H2AZ -> ./LNCaP_H2A.Z_ChIPseq_DHT_GSM686941.bw ... [4/7] extracting signals from file: H3K4me2 -> ./LNCaP_H3K4me2_ChIPseq_DHT_GSM686932.bw ... [5/7] extracting signals from file: H3K27ac -> ./LNCaP_H3K27ac_ChIPseq_DHT_GSM686937.bw ... [6/7] extracting signals from file: FoxA1 -> ./LNCaP_FoxA1_ChIPseq_DHT_GSM686926.bw ... [7/7] extracting signals from file: H3K27me3 -> ./LNCaP_H3K27me3_DHT_GSM969571.bw ...
R script to generate heatmap and linegraph. Note that for each epigenetic dataset (or each row in heatmap), all values were normalized into [0,1] for better visualization effect. This means colors between rows are NOT comparable.
Raw data extracted from bigwig files.
Heatmap showing epigenome profile centered on input genome regions
Ovelayed curves showing epigenome profile centered on input genome regions
Epigenomic and Genomic landscape around Androgen Receptor (AR) binding motif (ARE)
Epigenomic and Genomic landscape around CTCF binding motifs
Config file defines the program parameters and epigenetic datsets needed to run Epidaurus
Config file specifications
- Lines start with ”!” define the parameter-argument pairs. The parameter and argument must be space or Tab separated. For example
- Lines start with “%” define bigWig files. The keywords and data path must be space or Tab separated. keywords must be unique across the config file. Keywords should be as concise as possible.
- Line start with “&” defines seed BigWig files. The distance (measured by correlation) between the seed bigwig file and each of other datasets will be calculated. Then the order of each datasets on the heatmap will be determined by the value of distance. Only 1 seed BigWig file is allowed.
- All other lines will be skipped.
========================================================================================= Parameters ========================================================================================= !HALF_WINDOW_SIZE 1000 Description: 1) Specify half window size (bp). First we determine the middle point of a genomic intervals then extend half window size to both up and down stream. 2) Default =1000 (bp) (recommended for better visualization) 3) If HALF_WINDOW_SIZE is set to 0, all genomic intervals will be kept AS IS. Use this option, all genomicintervals need to be the same size. 4) 'HALF_WINDOW_SIZE' is reserved word, do NOT change. !HEAD_ROWS 2000 Description: 1) Specify how many rows in bed file were used. Default = 2000 (only top 2000 rows will be used). 2) 'HEAD_ROWS' is reserved word, do NOT change. !HM_FORMAT pdf Description: Heatmap format. This must be one of ('pdf', 'png', 'tiff'). default: pdf !DIST_METRIC kendall Description: Method to measure distance between seed data and each of the bigwig datasets. Must be one of ("pearson", "kendall", "spearman", "euclidean"). Default: kendall ========================================================================================= bigWig files ========================================================================================= below is the seed datset. Indicates with '&' &AR /data2/bsi/staff_analysis/LNCaP/data/AR.bw below are other datasets. Indicates wtih '%' %FoxA1 /data2/bsi/staff_analysis/LNCaP/data/FoxA1_ChIPseq_DHT_GSM686926.bw %DnaseI /data2/bsi/staff_analysis/LNCaP/data/DnaseI_Methyltrienolone_GSM816634.bw %H3K27ac /data2/bsi/staff_analysis/LNCaP/data/H3K27ac_ChIPseq_DHT_GSM686937.bw %H3K4me1 /data2/bsi/staff_analysis/LNCaP/data/H3K4me1_ChIPseq_DHT_GSM686928.bw %H3K27me3 /data2/bsi/staff_analysis/LNCaP/data/H3K27me3_DHT_GSM969571.bw %H3K4me2 /data2/bsi/staff_analysis/LNCaP/data/H3K4me2_ChIPseq_DHT_GSM686932.bw %H3K4me3 /data2/bsi/staff_analysis/LNCaP/data/H3K4me3_ChIPseq_DHT_GSM686935.bw %H3K9me3 /data2/bsi/staff_analysis/LNCaP/data/H3K9me3_R1881_GSM353610.bw %P300 /data2/bsi/staff_analysis/LNCaP/data/P300_ChIPseq_DHT_GSM686943.bw
Since Epidaurus only takes bigwig (a standard data format used by UCSC genome browser) files as input, making your own “database” is fairly simple: