Epigenomics is the study of heritable changes in gene expression that do not involve alterations to the DNA sequence itself. Unlike genomics — which focuses on the raw nucleotide sequence — epigenomics examines the chemical modifications and chromatin architecture that control which genes are active in which cell types. For bioinformaticians and computational biologists, self-hosting epigenomics analysis tools means reproducible workflows, data sovereignty for sensitive patient-derived datasets, and the ability to scale analysis pipelines across institutional compute clusters.

In this guide, we compare three foundational open-source tools for epigenomics data analysis: MACS3 for peak calling, deepTools for quality control and visualization, and methylKit for DNA methylation analysis. These tools form the backbone of most ChIP-seq, ATAC-seq, and bisulfite sequencing workflows.

Overview of Epigenomics Data Types

Before diving into the tools, it’s worth understanding the three primary epigenomics assays:

  • ChIP-seq (Chromatin Immunoprecipitation Sequencing): Maps histone modifications and transcription factor binding sites across the genome. The core analysis step is “peak calling” — identifying genomic regions enriched with sequenced fragments.
  • ATAC-seq (Assay for Transposase-Accessible Chromatin): Identifies open chromatin regions, revealing active regulatory elements like promoters and enhancers. Peak calling on ATAC-seq data requires handling the Tn5 transposase’s characteristic cut-site offset.
  • Bisulfite Sequencing (BS-seq / RRBS): Determines DNA methylation patterns at single-base resolution by converting unmethylated cytosines to uracil. Analysis involves calculating methylation ratios at individual CpG sites.
FeatureMACS3deepToolsmethylKit
Primary FunctionPeak calling for ChIP-seq/ATAC-seqQuality control, normalization, visualizationDifferential methylation analysis
GitHub Stars777761254
LanguagePythonPythonR
Input FormatsBAM, BED, SAM, BEDPEBAM, bigWig, BEDBismark/Samtools methylation calls
Installationpip, conda, Dockerpip, conda, DockerCRAN, Bioconductor
Key StrengthModel-based peak detectionComprehensive QC dashboardsStatistical rigor for methylation
Latest Releasev3.0 (2024)v3.5 (2023)v1.30 (2024)

MACS3: Model-Based Peak Calling

MACS3 (Model-based Analysis of ChIP-Seq) is the third major iteration of the widely-cited peak caller, used in over 50,000 published studies. MACS3 models the shift size of ChIP-seq fragments to improve peak resolution and uses a dynamic Poisson distribution to assess enrichment significance.

Installation via Docker

MACS3 provides an official Biocontainers image, making deployment straightforward:

1
2
3
4
5
6
7
# Pull the Biocontainers image
docker pull quay.io/biocontainers/macs3:3.0.2--py310hdfd78af_0

# Run MACS3 callpeak
docker run -v $(pwd):/data quay.io/biocontainers/macs3:3.0.2--py310hdfd78af_0 \
  macs3 callpeak -t /data/ChIP.bam -c /data/Input.bam \
  -f BAM -g hs -n ChIP_sample -B -q 0.05

Key Features

MACS3’s strength lies in its model-based approach — it empirically estimates fragment length from the data rather than assuming a fixed value. This matters because different ChIP protocols (native vs. crosslinked) produce different fragment distributions. MACS3 also generates fold-enrichment tracks (bigWig format) that can be directly loaded into the UCSC Genome Browser or IGV for visual inspection.

For ATAC-seq, MACS3 supports the --shift -75 --extsize 150 parameters that account for the Tn5 transposase binding offset. The --nomodel --nolambda flags are commonly used for ATAC-seq analysis since the fragment size distribution differs from ChIP-seq.

deepTools: Quality Control and Visualization

deepTools addresses the most painful part of epigenomics analysis: determining whether your experiment actually worked. It provides a suite of command-line tools that generate publication-ready plots and normalized coverage tracks.

Installation

1
2
3
4
5
# Via conda (recommended for all dependencies)
conda install -c bioconda deeptools

# Or via Docker
docker pull quay.io/biocontainers/deeptools:3.5.5--pyhdfd78af_0

Core Workflow

The typical deepTools workflow starts with bamCoverage to generate bigWig files, followed by computeMatrix and plotHeatmap:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Generate normalized coverage tracks
bamCoverage -b aligned.bam -o coverage.bw \
  --normalizeUsing RPKM --binSize 10

# Compute signal matrix around transcription start sites
computeMatrix reference-point \
  -S ChIP.bw Input.bw \
  -R genes.bed \
  --referencePoint TSS \
  -b 2000 -a 2000 \
  -o matrix.gz

# Generate heatmap
plotHeatmap -m matrix.gz -o heatmap.png \
  --colorMap Blues --zMax 5 --plotTitle "ChIP-seq Signal at TSS"

deepTools’ plotFingerprint command is particularly valuable — it generates a cumulative enrichment plot that reveals whether your ChIP worked by showing the separation between ChIP and input signal. A flat line at the diagonal indicates a failed experiment. The multiBigwigSummary tool computes genome-wide correlations between replicates, helping identify outlier samples before downstream analysis.

methylKit: Differential Methylation Analysis

methylKit is an R package from Bioconductor that performs statistical analysis of DNA methylation data from bisulfite sequencing experiments. It handles the unique statistical challenges of methylation data: bounded proportions (0-100%), varying coverage depths across CpG sites, and biological variability between samples.

Installation

1
2
3
4
# In R
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("methylKit")

Analysis Pipeline

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
library(methylKit)

# Read methylation call files from Bismark
file.list <- list("sample1_CpG.txt", "sample2_CpG.txt",
                  "control1_CpG.txt", "control2_CpG.txt")
myobj <- methRead(file.list,
                  sample.id = list("s1", "s2", "c1", "c2"),
                  assembly = "hg38",
                  treatment = c(1, 1, 0, 0),
                  context = "CpG")

# Filter low-coverage sites
filtered <- filterByCoverage(myobj, lo.count=10, hi.perc=99.9)

# Normalize read coverage
normalized <- normalizeCoverage(filtered)

# Unite samples at common CpG sites
meth <- unite(normalized)

# Differential methylation test
diff <- calculateDiffMeth(meth, overdispersion="MN", test="Chisq")

# Get significantly differentially methylated regions
diff_25pct <- getMethylDiff(diff, difference=25, qvalue=0.01)

methylKit supports multiple statistical tests including logistic regression and Fisher’s exact test. The overdispersion="MN" option corrects for biological variability that simple binomial models miss — critical for experiments with few biological replicates.

Complementary Tools: pyBigWig and pybedtools

Two additional Python libraries deserve mention for building complete epigenomics pipelines. pyBigWig (244 stars) provides random-access reading of bigWig and bigBed files without loading entire tracks into memory. pybedtools (330 stars) wraps BEDTools in Python, enabling genomic interval operations (intersections, merges, window-based calculations) within scripted workflows.

Building a Self-Hosted Epigenomics Pipeline

To tie these tools together, you can deploy a self-hosted analysis environment using Docker Compose with a JupyterLab frontend. Here’s a basic Compose configuration that provides both the R (methylKit) and Python (MACS3, deepTools) environments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# docker-compose.yml
version: "3.8"
services:
  epigenomics-python:
    image: jupyter/scipy-notebook:latest
    container_name: epi-python
    ports:
      - "8888:8888"
    volumes:
      - ./data:/home/jovyan/data
      - ./results:/home/jovyan/results
    command: start-notebook.sh --NotebookApp.token=''
    environment:
      - JUPYTER_ENABLE_LAB=yes

  epigenomics-r:
    image: bioconductor/bioconductor_docker:RELEASE_3_19
    container_name: epi-r
    ports:
      - "8787:8787"
    volumes:
      - ./data:/home/rstudio/data
      - ./results:/home/rstudio/results
    environment:
      - PASSWORD=changeme

Data Management for Epigenomics

Epigenomics datasets are substantial. A typical ChIP-seq experiment with 30 million reads produces ~3 GB of raw FASTQ and ~500 MB after alignment. Methylation data from whole-genome bisulfite sequencing can exceed 50 GB per sample. Plan storage accordingly — for a lab generating 20 experiments per year, budget at least 5 TB of redundant storage with automated backup to a self-hosted object store.

Why Self-Host Your Epigenomics Analysis?

Deploying epigenomics tools on your own infrastructure provides three critical advantages. First, data sovereignty: many epigenomics datasets contain identifiable genetic information subject to GDPR, HIPAA, or institutional IRB requirements. Cloud-based analysis platforms may not satisfy compliance requirements for human subjects data. Second, reproducibility: containerized self-hosted workflows with pinned software versions ensure that analyses can be exactly reproduced years later — essential for publications, regulatory submissions, and longitudinal studies. Third, cost predictability: while cloud computing offers elasticity, epigenomics analysis is computationally intensive and predictable (peak calling is CPU-bound, not bursty), making dedicated bare-metal or institutional cluster nodes more economical over multi-year research programs.

For genome assembly, see our guide to SPAdes, Canu, Flye and HiFiasm. For variant calling pipelines, see our comparison of GATK, FreeBayes and BCFtools. If you need a workflow orchestration platform for your epigenomics pipelines, check our bioinformatics workflow guide covering Galaxy, nf-core and CWL.

FAQ

What’s the difference between MACS2 and MACS3?

MACS3 is a complete rewrite of MACS2 in Python 3 with improved performance and new features including HMMRATAC for single-cell ATAC-seq data. MACS2 is Python 2-based and no longer actively maintained.

Can I use MACS3 for CUT&RUN or CUT&Tag data?

Yes. MACS3 includes specific parameters for CUT&RUN (--keep-dup all --broad) and CUT&Tag (using --nomodel --extsize 200). These newer techniques produce sharper peaks than traditional ChIP-seq.

How many biological replicates do I need for methylKit?

methylKit can work with as few as 2 replicates per condition, but statistical power increases substantially with 3 or more. With only 2 replicates, use the overdispersion correction (overdispersion="MN") to control false positives.

Why does deepTools’ plotFingerprint show no separation between ChIP and input?

This usually indicates a failed ChIP — either insufficient antibody, too few cells, or degraded chromatin. Verify your ChIP with qPCR at known positive and negative loci before sequencing.

Can these tools handle single-cell epigenomics data?

MACS3 has experimental support for scATAC-seq via HMMRATAC. For single-cell methylation data, specialized tools like MethSCAn or scBS-map are recommended over methylKit.

How do I cite these tools in publications?

Cite MACS3 as: Zhang et al. (2008) “Model-based Analysis of ChIP-Seq (MACS).” Genome Biology. Cite deepTools as: Ramírez et al. (2016) “deepTools2: a next generation web server for deep-sequencing data analysis.” Nucleic Acids Research. Cite methylKit as: Akalin et al. (2012) “methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles.” Genome Biology.


💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com