DiBiG – ICBR Bioinformatics, University of Florida
with support from the McKnight Brain Research
Foundation and NIH/NCI grant CA155390.
DMAP2 is a complete computational pipeline for the analysis of genome-wide methylation data from NGS sodium-bisulfite sequencing experiments in a high-performace computing environment.
DMAP2 is based on the MOABS pipeline, but contains many improvements and additional features. For each step of the pipeline we provide a detailed description, links to the tools used, inputs, outputs, parameters, and notes about parallelization. Note: the original pipeline is designed to run on UF’s HiPerGator cluster computer and cannot be easily distributed in its original form. On the other hand all components of the pipeline are free software, and this document can therefore be used as a roadmap to implement a pipeline equivalent to DMAP2 that is customized for your specific computational environment.
Conditions, Samples, Replicates, Contrasts
The pipeline is designed to handle multiple conditions, each containing one or more samples. For example, an experiment may compare six mutant (MUT) mice with five wildtype (WT) mice. In this case we have two conditions (MUT, WT) and 11 samples of which the first six belong to the MUT condition and the rest to the WT condition. If the same biological sample has been sequenced more than once, the resulting data files are treated as replicates. Finally, the pipeline handles both single-end and paired-end reads.
The overall goal of DMAP2 is to detect differential methylation between two conditions. Each comparison between two conditions is called a contrast, and the pipeline is able to handle any number of contrasts.
Inputs: the DMAP2 pipeline is controlled by a configuration file containing information about conditions, samples, and contrasts. The configuration file lists all the samples belonging to each condition, and all the fastq files belonging to each sample (or to each replicate of each sample).
Methylation calling, the core of the DMAP2 pipeline, is performed by the CSCALL program. CSCALL needs an index of the genome for the type of methylation site being analyzed, and provides a command to build indices for a large number of different methylation sites (e.g. CG, GC, GCG, etc). The index can be built before starting the pipeline and reused for any number of runs.
Inputs: genome reference sequence in FASTA format.
Outputs: binary file containing index of genome for specified site(s).
Tools: This step is performed by the build command of CSCALL.
Parameters: the names of the sites that indices should be built for. The list of known sites can be obtained with the CSCALL sites command.
Parallelization: all CSCALL build jobs for each reference sequence can be run in parallel.
Read trimming and quality control
As a first step, input short reads should be trimmed based on quality and QC analysis should be performed. Ideally, QC should be performed both before and after trimming, to get a direct evaluation of the effect of trimming. Optionally, the number of reads before and after trimming should be counted and reported to the user.
Inputs: all fastq files.
Outputs: trimmed fastq files; FastQC reports (HTML files)
Tools: we use trimmomatic for sequence trimming and FastQC for quality control.
Parallelization: all trimmomatic jobs for each fastq file (or pair of fastq files) can be run in parallel. All FastQC jobs on the original reads can be run in parallel with trimming, while FastQC jobs on the trimmed reads should be performed after the respective trimming job is done.
Incomplete conversion filtering
In bisulfite sequencing, C nucleotides that are not part of a potentially methylated site should always be converted to T. Therefore, an excessive number of Cs in a read (excluding those in potentially methylated sites) may be an indication of incomplete bisulfite conversion. This step analyzes all reads in each fastq file, and removes those with a number of “lone” Cs over a specified threshold.
Inputs: all fastq files.
Outputs: filtered fastq files, report showing percentage of reads removed.
Tools: This step is performed by the filter command of CSCALL.
Parameters: the maximum allowed number of unconverted Cs can be specified either as an absolute value, or as a fraction of the read length. The user should also specify the type of methylation site being analyzed (e.g., CG) so that C in these sites are ignored during filtering.
Parallelization: all CSCALL jobs for each fastq file (or pair of fastq files) can be run in parallel.
Bisulfite-aware mapping of short reads to the genome is performed using BSMAP. BSMAP is a very fast and sensitive
Inputs: all fastq files.
Outputs: one BAM file for each sample; report showing aligned reads percentage.
Tools: BSMAP, bamtools.
Parallelization: all BSMAP jobs for each fastq file (or pair of fastq files) can be run in parallel. All BAM files for each sample should then be merged into a single BAM file using bamtools.
Methylation calling is performed by the CSCALL call command. This command takes as inputs all the BAM files for a sample and builds a multiple pileup using samtools. CSCALL then examines the pileup at all locations contained in the genome index being used, corresponding to all occurrences of the methylation site of interest. At each site, CSCALL computes coverage (number of reads aligning at that position) and conversion rate (number of reads containing converted Cs over total number of reads) for each sample. A site is reported in output if its coverage is at least d in at least m samples, where d and m are parameters set by the user. If a site meets this condition, its overall methylation rate (total number of unconverted Cs over coverage) is reported to the output file.
Inputs: all BAM files for a sample.
Outputs: a BED file containing coordinates and methylation rate of each methylation site; additional summary files and reports.
Tools: This step is performed by the call command of CSCALL.
Parameters: the user should specify the genome index corresponding to the methylation site of interest, and the d and m parameters to control site filtering.
Parallelization: all CSCALL jobs for each sample can be run in parallel.
Differential methylation analysis is performed by MCOMP, part of the MOABS pipeline. MOABS takes as inputs the BED files produced by CSCALL for the conditions in each contrast (test and control), identifies sites present in both files, and compares them producing a differential methylation value with an associated P-value.
Since the output of MOABS is unnecessarily verbose, it is processed by a simple script to extract only the following columns: chromosome and position of each site, methylation level in control sample, methylation level in test sample, difference of methylation levels (test-control), P-value of the difference, and call. The call is determined on the basis of a P-value threshold pval and a difference threshold diff, and can be one of the following:
|.||methylation difference is not significant (P-value > pval);|
|++||methylation difference is positive and larger than diff|
|+||methylation difference is positive and smaller than diff|
|–||methylation difference is negative and smaller than diff|
|– –||methylation difference is negative and larger than diff.|
The processed differential methylation files are then filtered to extract only the significant sites. Full and filtered files are finally converted to Excel format using the csvtoxls.py tool available in the csvutils package.
Inputs: all BED files for the test and control samples in each contrast.
Outputs: for each contrast, a tab-delimited file containing all differentially methylated sites, and one containing only sites with a significant difference; Excel versions of the same files.
Tools: MCOMP, csvtoxls.py.
Parameters: the user can specify the P-value threshold pval and significance threshold diff.
Parallelization: all MCOMP jobs for each contrast can be run in parallel.
Gene-based methylation analysis
Differential methylation in genes is assessed by reading gene definitions from an annotation database (e.g., a GTF file) and identifying differentially methylated sites that lie within the desired gene regions (e.g. promoter, exons). This step is performed by the genediffmeth program. Genediffmeth reads annotations from GTF, GFF, Genbank or refFlat files, and has options to specify the gene region(s) of interest (any combination promoter, exons, introns, downstream), the minimum number of sites desired in each region, and how to combine the sites into a single score (e.g. average, absolute value average, highest value).
The output of genediffmeth is a tab-delimited file with the following columns: gene name, transcript accession, chromosome, start and end position, number of sites, combined methylation score.
Inputs: differential methylation files; gene annotation files.
Outputs: for each contrast, a tab-delimited file summarizing differentially methylated sites in genes; Excel versions of the same files.
Parameters: the user can specify the gene region(s) to be examined, the minimum number of sites required in each region, and how the differential methylation values for the sites in each region should be combined; multiple sets of options may be specified to perform analysis of different contexts (e.g. average methylation in promoters, highest methylation in exons, etc).
Parallelization: all genediffmeth jobs for each contrast and each set of parameters can be run in parallel.