Case study
This tutorial shows you how to perform a complex bicycle analysis using a real case study dataset. For simplicity, this tutorial uses the bicycle docker image, so make sure that you have docker already installed in your computer.
These steps are also available in this bash script. Alternatively, if you can't use Docker, you can use this bash script that you must edit in order to provide the right paths to the bicycle command and the samtools and bowtie2 binaries directories.
As case study, a public dataset from this paper from Bernstein et al. 2015 that describes a new targeted bisulfite sequencing approach is used. This dataset consists of five non-diabetic and five type 2 diabetic samples from human pancreatic islets that were sequenced with an Illumina Miseq sequencer (14.15 million of 150 base pair paired-end reads). The aim of that study was to determine the levels of methylation in particular regions of five genes (MEG3, INS, IRS1, CDKN1A and PDE7B) with implications in this metabolic disorder. This case study shows how bicycle can reproduce that results.
Timing: ≈ 3h 35 minutes (without taking account of the download data step, which may vary depending on the internet connection)
Equipment: Ubuntu 14.04.3 LTS, 4 cores (Intel(R) Core(TM) i5 @ 2.20GHz), 16GB of RAM and SSD disk.
1. Declare a variable with the bicycle command
Run the following instruction in order to define the bicycle command under docker. Your current working directory (pwd) will be seen inside docker as /data (the parameter -Xmx16G sets the maximum memory to use. By default, it takes around a 25% of the total physical memory):
alias bicycle="docker run -e JVM_ARGS=-Xmx16G -v `pwd`/data-case-study:/data -u `id -u \`whoami\`` -it singgroup/bicycle bicycle"
2. Download the case study data
mkdir -p data-case-study
wget http://static.sing-group.org/software/bicycle/data/case_study/raw_data.zip
unzip raw_data.zip -d data-case-study
wget http://static.sing-group.org/software/bicycle/data/case_study/reference_genome.zip
unzip reference_genome.zip -d data-case-study
wget http://static.sing-group.org/software/bicycle/data/case_study/target_regions.bed -O data-case-study/target_regions.bed
3. Declare the Docker data directories
REFERENCE_GENOME="data/referenceGenome"
TARGET_REGIONS="data/target_regions.bed"
SAMPLES_DIR="data/raw_data"
PROJECT_DIR="data/bicycle-case-study-project"
4. Create a project
bicycle create-project -p $PROJECT_DIR -r $REFERENCE_GENOME -f $SAMPLES_DIR --paired-mate1-regexp _1.fastq
5. Align reads to both references
bicycle align -p $PROJECT_DIR -t 4 --bowtie2-quals phred33 --bowtie2-I 0 --bowtie2-X 500
6. Perform methylation analysis and methylcytosine calling
bicycle analyze-methylation -p $PROJECT_DIR -t 4 -n 1 --remove-ambiguous --only-with-one-alignment -b $TARGET_REGIONS
7. Perform the differential methylation analysis
bicycle analyze-differential-methylation -p $PROJECT_DIR -c SRR2052487,SRR2052488,SRR2052489,SRR2052490,SRR2052491 -t SRR2052492,SRR2052493,SRR2052494,SRR2052495,SRR2052496 -b $TARGET_REGIONS
8. Check the results
Results are generated in directory data-case-study/bicycle-case-study-project/output. Navigate into this directory to explore the results obtained. The Manual section explains how results must be interpreted.
Different output files are generated, being the following the most important from the user point of view:
- *.summary (one per sample): provides a methylation summary for each sample, highlighting error rates, called methylcytosines and methylation values for each called methylcytosine in each methylation context (CG, CHG and CHH), globally and separately by Watson and Crick strands, and finally the adjusted p-values.
- *.methylcytosines (one per sample): contains all methylation details within a sample for each cytosine in the genome (chromosome, position, strand, methylation context, read depth, CT depth, cytosine count, beta score, pileup, p-value, context correction information, genomic region/s to which it belongs to (in case that genomic annotations were provided), an the status of methylation according to the p-value (methylated/unmethylated).
- *.vcf (one per sample): a file with similar information as the *.methylcytosines file, but in VCF format (useful for VCF viewers, for instance).
- *.METHYLATEDregions: summary of methylation levels per annotated region, for Watson, and Crick strands and globally, calculating the methylation level of a particular region as weighted mean of cytosine methylation (WMCM). This information is provided for each methylation context (CG, CHG and CHH).
- *.DMC.tsv: tabulated text file that provides differential methylation for each cytosine (DMC) if two different conditions were compared or tested for differential methylation. For each cytosine, the following information is written: chromosome and cytosine position, methylation context, cytosine methylation for each sample (replicate) of the treatment condition and for each sample (replicate) of the control condition, methylation average for each condition, fold-change of treatment/control methylation in log2, p-value and q-value (corrected p-value).
- *.DMR.tsv: tabulated text file that provides differential methylation for each annotated region that was provided. The information written here is similar to the one written in *.DMC.tsv, but in genomic region level.