Development
tooluniverse-rnaseq-deseq2 - Claude MCP Skill
Production-ready RNA-seq differential expression analysis using PyDESeq2. Performs DESeq2 normalization, dispersion estimation, Wald testing, LFC shrinkage, and result filtering. Handles multi-factor designs, multiple contrasts, batch effects, and integrates with gene enrichment (gseapy) and ToolUniverse annotation tools (UniProt, Ensembl, OpenTargets). Supports CSV/TSV/H5AD input formats and any organism. Use when analyzing RNA-seq count matrices, identifying DEGs, performing differential expression with statistical rigor, or answering questions about gene expression changes.
SEO Guide: Enhance your AI agent with the tooluniverse-rnaseq-deseq2 tool. This Model Context Protocol (MCP) server allows Claude Desktop and other LLMs to production-ready rna-seq differential expression analysis using pydeseq2. performs deseq2 normalizat... Download and configure this skill to unlock new capabilities for your AI workflow.
Documentation
SKILL.md# RNA-seq Differential Expression Analysis (DESeq2) Differential expression analysis of RNA-seq count data using PyDESeq2, with enrichment analysis (gseapy) and gene annotation via ToolUniverse. **BixBench Coverage**: Validated on 53 BixBench questions across 15 computational biology projects. ## Domain Reasoning DESeq2 assumes that most genes are NOT differentially expressed — this is its normalization assumption. If this assumption is violated (e.g., global transcriptional shutdown, where the majority of genes genuinely decrease), size factor normalization will inflate expression in the treatment group and produce artifactually upregulated genes. Always check the MA plot: the fold-change cloud should be centered on zero across all expression levels. A systematic upward or downward shift indicates a normalization problem, not biology. ## LOOK UP DON'T GUESS - Gene identifiers and annotations: use ToolUniverse annotation tools (`MyGene_query_genes`, UniProt); do not recall gene function or pathway from memory. - Enriched pathways: run gseapy or equivalent on the actual DEG list; do not list expected pathways. - Design formula factors: inspect `metadata.columns` and `metadata[factor].unique()` from the actual data; do not assume metadata structure. - DEG thresholds: apply the values specified by the user (padj, log2FC, baseMean); do not substitute defaults without checking the question. --- ## Core Principles 1. **Data-first** - Load and validate count data and metadata BEFORE any analysis 2. **Statistical rigor** - Proper normalization, dispersion estimation, multiple testing correction 3. **Flexible design** - Single-factor, multi-factor, and interaction designs 4. **Threshold awareness** - Apply user-specified thresholds exactly (padj, log2FC, baseMean) 5. **Reproducible** - Set random seeds, document all parameters 6. **Question-driven** - Parse what the user is actually asking; extract the specific answer 7. **Enrichment integration** - Chain DESeq2 results into pathway/GO enrichment when requested ## When to Use - RNA-seq count matrices needing differential expression analysis - DESeq2, DEGs, padj, log2FC questions - Dispersion estimates or diagnostics - GO, KEGG, Reactome enrichment on DEGs - Specific gene expression changes between conditions - Batch effect correction in RNA-seq ## Required Packages ```python import pandas as pd, numpy as np from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats import gseapy as gp # enrichment (optional) from tooluniverse import ToolUniverse # annotation (optional) ``` ## Analysis Workflow ### Step 1: Parse the Question Extract: data files, thresholds (padj/log2FC/baseMean), design factors, contrast, direction, enrichment type, specific genes. See [question_parsing.md](references/question_parsing.md). ### Step 2: Load & Validate Data Load counts + metadata, ensure samples-as-rows/genes-as-columns, verify integer counts, align sample names, remove zero-count genes. See [data_loading.md](references/data_loading.md). ### Step 2.5: Inspect Metadata (REQUIRED) List ALL metadata columns and levels. Categorize as biological interest vs batch/block. Build design formula with covariates first, factor of interest last. See [design_formula_guide.md](references/design_formula_guide.md). ### Step 3: Run PyDESeq2 Set reference level via `pd.Categorical`, create `DeseqDataSet`, call `dds.deseq2()`, extract `DeseqStats` with contrast, run Wald test, optionally apply LFC shrinkage. See [pydeseq2_workflow.md](references/pydeseq2_workflow.md). **Tool boundaries**: - **Python (PyDESeq2)**: ALL DESeq2 analysis - **ToolUniverse**: ONLY gene annotation (ID conversion, pathway context) - **gseapy**: Enrichment analysis (GO/KEGG/Reactome) ### Step 4: Filter Results Apply padj, log2FC, baseMean thresholds. Split by direction if needed. See [result_filtering.md](references/result_filtering.md). ### Step 5: Dispersion Analysis (if asked) Key columns: `genewise_dispersions`, `fitted_dispersions`, `MAP_dispersions`, `dispersions`. See [dispersion_analysis.md](references/dispersion_analysis.md). ### Step 6: Enrichment (optional) Use gseapy `enrich()` with appropriate gene set library. See [enrichment_analysis.md](references/enrichment_analysis.md). ### Step 7: Gene Annotation (optional) Use ToolUniverse for ID conversion and gene context only. See [output_formatting.md](references/output_formatting.md). ## Common Patterns | Pattern | Type | Key Operation | |---------|------|---------------| | 1 | DEG count | `len(results[(padj<0.05) & (abs(lfc)>0.5)])` | | 2 | Gene value | `results.loc['GENE', 'log2FoldChange']` | | 3 | Direction | Filter `log2FoldChange > 0` or `< 0` | | 4 | Set ops | `degs_A - degs_B` for unique DEGs | | 5 | Dispersion | `(dds.var['genewise_dispersions'] < thr).sum()` | See [bixbench_examples.md](references/bixbench_examples.md) for all 10 patterns with examples. ## Error Quick Reference | Error | Fix | |-------|-----| | No matching samples | Transpose counts; strip whitespace | | Dispersion trend no converge | `fit_type='mean'` | | Contrast not found | Check `metadata['factor'].unique()` | | Non-integer counts | Round to int OR use t-test | | NaN in padj | Independent filtering removed genes | See [troubleshooting.md](references/troubleshooting.md) for full debugging guide. ## Interpretation Framework ### DESeq2 Result Interpretation | Metric | Threshold | Interpretation | |--------|-----------|---------------| | **padj** | < 0.05 | Statistically significant after multiple testing correction | | **log2FoldChange** | > 1 or < -1 | Biologically meaningful fold change (2x up or down) | | **baseMean** | > 10 | Gene is expressed at detectable levels | | **lfcSE** | < 1.0 | Fold change estimate is precise | ### Evidence Grading for DEGs | Grade | Criteria | Action | |-------|---------|--------| | **Strong DEG** | padj < 0.01, |LFC| > 1.5, baseMean > 100 | High-confidence; report and annotate | | **Moderate DEG** | padj < 0.05, |LFC| > 1.0, baseMean > 10 | Standard cutoff; include in enrichment | | **Weak DEG** | padj < 0.1 or |LFC| 0.5-1.0 | Suggestive; note but don't prioritize | | **Not significant** | padj >= 0.1 | Do not report as differentially expressed | ### Synthesis Questions 1. **How many DEGs and in which direction?** (up vs down ratio indicates biological response type) 2. **What pathways are enriched?** (GO/KEGG enrichment of DEGs reveals mechanism) 3. **Are the top DEGs biologically plausible?** (known markers for the condition?) 4. **Is the fold change magnitude realistic?** (LFC > 5 is unusual; check for outlier-driven effects) 5. **Are there batch effects?** (PCA should separate by condition, not by batch) --- ## Known Limitations - **PyDESeq2 vs R DESeq2**: Numerical differences exist for very low dispersion genes (<1e-05). For exact R reproducibility, use rpy2. - **gseapy vs R clusterProfiler**: Results may differ. See [r_clusterprofiler_guide.md](references/r_clusterprofiler_guide.md). ## Reference Files - [question_parsing.md](references/question_parsing.md) - Extract parameters from questions - [data_loading.md](references/data_loading.md) - Data loading and validation - [design_formula_guide.md](references/design_formula_guide.md) - Multi-factor design decision tree - [pydeseq2_workflow.md](references/pydeseq2_workflow.md) - Complete PyDESeq2 code examples - [result_filtering.md](references/result_filtering.md) - Advanced filtering and extraction - [dispersion_analysis.md](references/dispersion_analysis.md) - Dispersion diagnostics - [enrichment_analysis.md](references/enrichment_analysis.md) - GO/KEGG/Reactome workflows - [output_formatting.md](references/output_formatting.md) - Format answers correctly - [bixbench_examples.md](references/bixbench_examples.md) - All 10 question patterns - [troubleshooting.md](references/troubleshooting.md) - Common issues and debugging - [r_clusterprofiler_guide.md](references/r_clusterprofiler_guide.md) - R clusterProfiler via rpy2 ## Utility Scripts - [format_deseq2_output.py](scripts/format_deseq2_output.py) - Output formatters - [load_count_matrix.py](scripts/load_count_matrix.py) - Data loading utilities
Signals
Information
- Repository
- mims-harvard/ToolUniverse
- Author
- mims-harvard
- Last Sync
- 5/10/2026
- Repo Updated
- 5/10/2026
- Created
- 2/19/2026
Reviews (0)
No reviews yet. Be the first to review this skill!
Related Skills
upgrade-nodejs
Upgrading Bun's Self-Reported Node.js Version
cursorrules
CrewAI Development Rules
CLAUDE
CLAUDE.md
fastmcp-client-cli
Query and invoke tools on MCP servers using fastmcp list and fastmcp call. Use when you need to discover what tools a server offers, call tools, or integrate MCP servers into workflows.
Related Guides
Mastering the Oracle CLI: A Complete Guide to the Claude Skill for Database Professionals
Learn how to use the oracle Claude skill. Complete guide with installation instructions and examples.
Python Django Best Practices: A Comprehensive Guide to the Claude Skill
Learn how to use the python django best practices Claude skill. Complete guide with installation instructions and examples.
Mastering Python and TypeScript Development with the Claude Skill Guide
Learn how to use the python typescript guide Claude skill. Complete guide with installation instructions and examples.