Testing

tooluniverse-epigenomics - Claude MCP Skill

Production-ready genomics and epigenomics data processing for BixBench questions. Handles methylation array analysis (CpG filtering, differential methylation, age-related CpG detection, chromosome-level density), ChIP-seq peak analysis (peak calling, motif enrichment, coverage stats), ATAC-seq chromatin accessibility, multi-omics integration (expression + methylation correlation), and genome-wide statistics. Pure Python computation (pandas, scipy, numpy, pysam, statsmodels) plus ToolUniverse annotation tools (Ensembl, ENCODE, SCREEN, JASPAR, ReMap, RegulomeDB, ChIPAtlas). Supports BED, BigWig, methylation beta-value matrices, Illumina manifest files, and multi-sample clinical data. Use when processing methylation data, ChIP-seq peaks, ATAC-seq signals, or answering questions about CpG sites, differential methylation, chromatin accessibility, histone marks, or epigenomic statistics.

SEO Guide: Enhance your AI agent with the tooluniverse-epigenomics tool. This Model Context Protocol (MCP) server allows Claude Desktop and other LLMs to production-ready genomics and epigenomics data processing for bixbench questions. handles methylatio... Download and configure this skill to unlock new capabilities for your AI workflow.

🌟16 stars • 173 forks
📥0 downloads

Documentation

SKILL.md
# Genomics and Epigenomics Data Processing

Production-ready computational skill for processing and analyzing epigenomics data. Combines local Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for regulatory context. Designed to solve BixBench-style questions about methylation, ChIP-seq, ATAC-seq, and multi-omics integration.

## When to Use This Skill

**Triggers**:
- User provides methylation data (beta-value matrices, Illumina arrays) and asks about CpG sites
- Questions about differential methylation analysis
- Age-related CpG detection or epigenetic clock questions
- Chromosome-level methylation density or statistics
- ChIP-seq peak files (BED format) with analysis questions
- ATAC-seq chromatin accessibility questions
- Multi-omics integration (expression + methylation, expression + ChIP-seq)
- Genome-wide epigenomic statistics
- Questions mentioning "methylation", "CpG", "ChIP-seq", "ATAC-seq", "histone", "chromatin", "epigenetic"
- Questions about missing data across clinical/genomic/epigenomic modalities
- Regulatory element annotation for processed epigenomic data

**Example Questions**:
1. "How many patients have no missing data for vital status, gene expression, and methylation data?"
2. "What is the ratio of filtered age-related CpG density between chromosomes?"
3. "What is the genome-wide average chromosomal density of unique age-related CpGs per base pair?"
4. "How many CpG sites show significant differential methylation (padj < 0.05)?"
5. "What is the Pearson correlation between methylation and expression for gene X?"
6. "How many ChIP-seq peaks overlap with promoter regions?"
7. "What fraction of ATAC-seq peaks are in enhancer regions?"
8. "Which chromosome has the highest density of hypermethylated CpGs?"
9. "Filter CpG sites by variance > threshold and map to nearest genes"
10. "What is the average beta value difference between tumor and normal for chromosome 17?"

**NOT for** (use other skills instead):
- Gene regulation lookup without data files -> Use existing epigenomics annotation pattern
- RNA-seq differential expression -> Use `tooluniverse-rnaseq-deseq2`
- Variant calling/annotation from VCF -> Use `tooluniverse-variant-analysis`
- Gene enrichment analysis -> Use `tooluniverse-gene-enrichment`
- Protein structure analysis -> Use `tooluniverse-protein-structure-retrieval`

---

## Required Python Packages

```python
# Core (MUST be available)
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multitest as mt

# Optional but useful
import pysam      # BAM/CRAM file access
import gseapy     # Enrichment of genes from methylation analysis

# ToolUniverse (for annotation)
from tooluniverse import ToolUniverse
```

---

## Key Principles

1. **Data-first approach** - Load and inspect data files BEFORE any analysis
2. **Question-driven** - Parse what the user is actually asking and extract the specific numeric answer
3. **File format detection** - Automatically detect methylation arrays, BED files, BigWig, clinical data
4. **Coordinate system awareness** - Track genome build (hg19, hg38, mm10), handle chr prefix differences
5. **Statistical rigor** - Proper multiple testing correction, effect size filtering, sample size awareness
6. **Missing data handling** - Explicitly report and handle NaN/missing values
7. **Chromosome normalization** - Always normalize chromosome names (chr1 vs 1, chrX vs X)
8. **CpG site identification** - Parse Illumina probe IDs (cg/ch probes), genomic coordinates
9. **Report-first** - Create output file first, populate progressively
10. **English-first queries** - Use English in all tool calls

---

## Workflow Overview

### Phase 0: Question Parsing and Data Discovery

Before writing any code, parse the question to identify:
- What data files are available (methylation, ChIP-seq, ATAC-seq, clinical, expression, manifest)
- What specific statistic or answer is being asked for
- What thresholds apply (significance, effect size, variance, chromosome filters)
- What genome build to use

Categorize files by scanning for keywords: methyl/beta/cpg/illumina, chip/peak/narrowpeak, atac/accessibility, clinical/patient/sample, express/rnaseq/fpkm, manifest/annotation/probe.

See `ANALYSIS_PROCEDURES.md` for the full decision tree and parameter extraction table.

### Phase 1: Methylation Data Processing

Core functions for methylation analysis:
- **Load methylation data** - Supports CSV, TSV, parquet, HDF5; auto-detects beta vs M-values
- **Load probe manifest** - Illumina 450K/EPIC manifest with chromosome, position, gene annotation
- **CpG filtering** - Filter by variance, missing rate, probe type (cg/ch), chromosome, CpG island relation, gene group
- **Differential methylation** - T-test/Wilcoxon/KS between groups with FDR correction; identify DMPs (hyper/hypo)
- **Age-related CpG analysis** - Pearson/Spearman correlation of probes with age, FDR correction
- **Chromosome-level density** - CpG count per chromosome divided by chromosome length; density ratios; genome-wide average

See `CODE_REFERENCE.md` Phase 1 for full function implementations.

### Phase 2: ChIP-seq Peak Analysis

- **Load BED/narrowPeak/broadPeak** - Auto-detect format, normalize chromosomes
- **Peak statistics** - Count, length distribution, signal values, q-values
- **Peak annotation** - Map peaks to nearest gene, classify as promoter/gene_body/proximal/distal
- **Peak overlap** - Pure Python interval intersection between two BED files; Jaccard similarity

See `CODE_REFERENCE.md` Phase 2 for full function implementations.

### Phase 3: ATAC-seq Analysis

- **Load ATAC peaks** - Wrapper around BED loader for narrowPeak format
- **ATAC-specific stats** - Nucleosome-free region (NFR) detection (<150bp peaks), region classification
- **Chromatin accessibility by region** - Distribution of open chromatin across promoter/enhancer/intergenic

See `CODE_REFERENCE.md` Phase 3 for full function implementations.

### Phase 4: Multi-Omics Integration

- **Methylation-expression correlation** - Align samples, compute per-probe-gene Pearson/Spearman with FDR
- **ChIP-seq + expression** - Find genes with promoter peaks and compare expression levels

See `CODE_REFERENCE.md` Phase 4 for full function implementations.

### Phase 5: Clinical Data Integration

- **Missing data analysis** - Count samples present across clinical, expression, and methylation modalities
- **Complete case identification** - Find intersection of samples with non-missing values for specified variables

See `CODE_REFERENCE.md` Phase 5 for full function implementations.

### Phase 6: ToolUniverse Annotation

Use ToolUniverse tools to add biological context after computational analysis:
- **Gene annotation** - Ensembl lookup for coordinates, biotype, cross-references
- **Regulatory elements** - SCREEN cCREs (enhancers, promoters, insulators)
- **ChIPAtlas** - Query available ChIP-seq experiments by antigen/cell type
- **Ensembl regulatory features** - Annotate genomic regions with regulatory overlaps

See `CODE_REFERENCE.md` Phase 6 and `TOOLS_REFERENCE.md` for parameters.

### Phase 7: Genome-Wide Statistics

- **Comprehensive methylation stats** - Global mean/median beta, probe variance, chromosome density
- **Differential methylation summary** - Count significant, hyper/hypo split, effect sizes

See `CODE_REFERENCE.md` Phase 7 for full function implementations.

---

## Common Analysis Patterns

| Pattern | Input | Key Steps | Output |
|---------|-------|-----------|--------|
| Differential methylation | Beta matrix + clinical | Filter probes -> define groups -> t-test -> FDR -> threshold | Count of significant DMPs |
| Age-related CpG density | Beta matrix + manifest + ages | Correlate with age -> FDR -> map to chr -> density per chr | Density ratio between chromosomes |
| Multi-omics missing data | Clinical + expression + methylation | Extract sample IDs -> intersect -> check NaN | Complete case count |
| ChIP-seq peak annotation | BED/narrowPeak + gene annotation | Load peaks -> annotate to genes -> classify regions | Fraction in promoters |
| Methylation-expression | Beta matrix + expression + probe-gene map | Align samples -> correlate -> FDR | Significant anti-correlations |

See `ANALYSIS_PROCEDURES.md` for detailed step-by-step flows and edge case handling.

---

## Key Functions Reference

| Function | Purpose | Input | Output |
|----------|---------|-------|--------|
| `load_methylation_data()` | Load beta/M-value matrix | file path | DataFrame |
| `detect_methylation_type()` | Detect beta vs M-values | DataFrame | 'beta' or 'mvalue' |
| `filter_cpg_probes()` | Filter probes by criteria | DataFrame + filters | filtered DataFrame |
| `differential_methylation()` | DM analysis between groups | beta + samples | DataFrame with padj |
| `identify_age_related_cpgs()` | Age-correlated CpGs | beta + ages | DataFrame with padj |
| `chromosome_cpg_density()` | CpG density per chromosome | probes + manifest | density DataFrame |
| `genome_wide_average_density()` | Overall genome density | density DataFrame | float |
| `chromosome_density_ratio()` | Ratio between chromosomes | density + chr names | float |
| `load_bed_file()` | Load BED/narrowPeak | file path | DataFrame |
| `peak_statistics()` | Basic peak stats | BED DataFrame | dict |
| `annotate_peaks_to_genes()` | Annotate peaks to genes | peaks + genes | annotated DataFrame |
| `find_overlaps()` | Peak overlap analysis | two BED DataFrames | overlap DataFrame |
| `missing_data_analysis()` | Cross-modality completeness | multiple DataFrames | dict |
| `correlate_methylation_expression()` | Meth-expression correlation | beta + expression | correlation DataFrame |

---

## ToolUniverse Tools Used

### Regulatory Annotation Tools
- `ensembl_lookup_gene` - Gene coordinates, biotype (REQUIRES `species='homo_sapiens'`)
- `ensembl_get_regulatory_features` - Regulatory features by region (NO "chr" prefix in region)
- `ensembl_get_overlap_features` - Gene/transcript overlap data
- `SCREEN_get_regulatory_elements` - cCREs: enhancers, promoters, insulators
- `ReMap_get_transcription_factor_binding` - TF binding sites
- `RegulomeDB_query_variant` - Variant regulatory score
- `jaspar_search_matrices` - TF binding matrices
- `ENCODE_search_experiments` - Experiment metadata (assay_title must be "TF ChIP-seq" not "ChIP-seq")
- `ChIPAtlas_get_experiments` - ChIP-seq experiments (REQUIRES `operation` param)
- `ChIPAtlas_search_datasets` - Dataset search (REQUIRES `operation` param)
- `ChIPAtlas_enrichment_analysis` - Enrichment from BED/motifs/genes
- `ChIPAtlas_get_peak_data` - Peak data download (REQUIRES `operation` param)
- `FourDN_search_data` - Chromatin conformation data (REQUIRES `operation` param)

### Gene Annotation Tools
- `MyGene_query_genes` - Gene query
- `MyGene_batch_query` - Batch gene query
- `HGNC_get_gene_info` - Gene symbol, aliases, IDs
- `GO_get_annotations_for_gene` - GO annotations

See `TOOLS_REFERENCE.md` for full parameter details and return schemas.

---

## Data Format Notes

- **Methylation data**: Probes (rows) x samples (columns), beta values 0-1
- **BED files**: Tab-separated, 0-based half-open coordinates
- **narrowPeak**: 10-column BED extension with signalValue, pValue, qValue, peak
- **Illumina manifests**: Probe ID, chromosome, position, gene annotation
- **Clinical data**: Patient/sample-centric with clinical variables as columns

## Genome Builds Supported

| Build | Species | Autosomes | Sex Chromosomes |
|-------|---------|-----------|-----------------|
| hg38 (GRCh38) | Human | chr1-chr22 | chrX, chrY |
| hg19 (GRCh37) | Human | chr1-chr22 | chrX, chrY |
| mm10 (GRCm38) | Mouse | chr1-chr19 | chrX, chrY |

## Limitations

- No native pybedtools: uses pure Python interval operations
- No native pyBigWig: cannot read BigWig files directly without package
- No R bridge: does not use methylKit, ChIPseeker, or DiffBind
- Illumina-centric: methylation functions designed for 450K/EPIC arrays
- Uses t-test/Wilcoxon for differential methylation (not limma/bumphunter)
- No peak calling: assumes peaks are pre-called
- API rate limits: ToolUniverse annotation limited to ~20 genes per batch

---

## Reference Files

- `CODE_REFERENCE.md` - Full Python function implementations for all phases
- `TOOLS_REFERENCE.md` - ToolUniverse tool parameter details and return schemas
- `ANALYSIS_PROCEDURES.md` - Decision trees, step-by-step analysis patterns, edge cases, fallback strategies
- `QUICK_START.md` - Quick start examples for common analysis types

Signals

Avg rating0.0
Reviews0
Favorites0

Information

Repository
mims-harvard/ToolUniverse
Author
mims-harvard
Last Sync
3/12/2026
Repo Updated
3/12/2026
Created
2/16/2026

Reviews (0)

No reviews yet. Be the first to review this skill!