Development

ab-test-setup - Claude MCP Skill

Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.

SEO Guide: Enhance your AI agent with the ab-test-setup tool. This Model Context Protocol (MCP) server allows Claude Desktop and other LLMs to structured guide for setting up a/b tests with mandatory gates for hypothesis, metrics, and executio... Download and configure this skill to unlock new capabilities for your AI workflow.

🌟1 stars • 0 forks

📥0 downloads

Testing

View on GitHub🔗 Claude Servers

Documentation

SKILL.md

# A/B Test Setup

## 1️⃣ Purpose & Scope

Ensure every A/B test is **valid, rigorous, and safe** before a single line of code is written.

- Prevents "peeking"
- Enforces statistical power
- Blocks invalid hypotheses

---

## 2️⃣ Pre-Requisites

You must have:

- A clear user problem
- Access to an analytics source
- Roughly estimated traffic volume

### Hypothesis Quality Checklist

A valid hypothesis includes:

- Observation or evidence
- Single, specific change
- Directional expectation
- Defined audience
- Measurable success criteria

---

### 3️⃣ Hypothesis Lock (Hard Gate)

Before designing variants or metrics, you MUST:

- Present the **final hypothesis**
- Specify:
  - Target audience
  - Primary metric
  - Expected direction of effect
  - Minimum Detectable Effect (MDE)

Ask explicitly:

> “Is this the final hypothesis we are committing to for this test?”

**Do NOT proceed until confirmed.**

---

### 4️⃣ Assumptions & Validity Check (Mandatory)

Explicitly list assumptions about:

- Traffic stability
- User independence
- Metric reliability
- Randomization quality
- External factors (seasonality, campaigns, releases)

If assumptions are weak or violated:

- Warn the user
- Recommend delaying or redesigning the test

---

### 5️⃣ Test Type Selection

Choose the simplest valid test:

- **A/B Test** – single change, two variants
- **A/B/n Test** – multiple variants, higher traffic required
- **Multivariate Test (MVT)** – interaction effects, very high traffic
- **Split URL Test** – major structural changes

Default to **A/B** unless there is a clear reason otherwise.

---

### 6️⃣ Metrics Definition

#### Primary Metric (Mandatory)

- Single metric used to evaluate success
- Directly tied to the hypothesis
- Pre-defined and frozen before launch

#### Secondary Metrics

- Provide context
- Explain _why_ results occurred
- Must not override the primary metric

#### Guardrail Metrics

- Metrics that must not degrade
- Used to prevent harmful wins
- Trigger test stop if significantly negative

---

### 7️⃣ Sample Size & Duration

Define upfront:

- Baseline rate
- MDE
- Significance level (typically 95%)
- Statistical power (typically 80%)

Estimate:

- Required sample size per variant
- Expected test duration

**Do NOT proceed without a realistic sample size estimate.**

---

### 8️⃣ Execution Readiness Gate (Hard Stop)

You may proceed to implementation **only if all are true**:

- Hypothesis is locked
- Primary metric is frozen
- Sample size is calculated
- Test duration is defined
- Guardrails are set
- Tracking is verified

If any item is missing, stop and resolve it.

---

## Running the Test

### During the Test

**DO:**

- Monitor technical health
- Document external factors

**DO NOT:**

- Stop early due to “good-looking” results
- Change variants mid-test
- Add new traffic sources
- Redefine success criteria

---

## Analyzing Results

### Analysis Discipline

When interpreting results:

- Do NOT generalize beyond the tested population
- Do NOT claim causality beyond the tested change
- Do NOT override guardrail failures
- Separate statistical significance from business judgment

### Interpretation Outcomes

| Result               | Action                                 |
| -------------------- | -------------------------------------- |
| Significant positive | Consider rollout                       |
| Significant negative | Reject variant, document learning      |
| Inconclusive         | Consider more traffic or bolder change |
| Guardrail failure    | Do not ship, even if primary wins      |

---

## Documentation & Learning

### Test Record (Mandatory)

Document:

- Hypothesis
- Variants
- Metrics
- Sample size vs achieved
- Results
- Decision
- Learnings
- Follow-up ideas

Store records in a shared, searchable location to avoid repeated failures.

---

## Refusal Conditions (Safety)

Refuse to proceed if:

- Baseline rate is unknown and cannot be estimated
- Traffic is insufficient to detect the MDE
- Primary metric is undefined
- Multiple variables are changed without proper design
- Hypothesis cannot be clearly stated

Explain why and recommend next steps.

---

## Key Principles (Non-Negotiable)

- One hypothesis per test
- One primary metric
- Commit before launch
- No peeking
- Learning over winning
- Statistical rigor first

---

## Final Reminder

A/B testing is not about proving ideas right.
It is about **learning the truth with confidence**.

If you feel tempted to rush, simplify, or “just try it” —
that is the signal to **slow down and re-check the design**.

## When to Use
This skill is applicable to execute the workflow or actions described in the overview.

Signals

Avg rating⭐ 0.0

Reviews0

Favorites0

Information

Repository: arlenagreer/claude_configuration_docs
Author: arlenagreer
Last Sync: 5/10/2026
Repo Updated: 5/7/2026
Created: 1/23/2026

Reviews (0)

No reviews yet. Be the first to review this skill!

Related Skills

upgrade-nodejs

Upgrading Bun's Self-Reported Node.js Version

⭐ 44899Has guide

cursorrules

CrewAI Development Rules

⭐ 43932Has guide

Confidence Check

Pre-implementation confidence assessment (≥90% required). Use before starting any implementation to verify readiness with duplicate check, architecture compliance, official docs verification, OSS references, and root cause identification.

⭐ 22726Has guide

mcp-builder

Build MCP (Model Context Protocol) servers that give Claude new capabilities. Use when user wants to create an MCP server, add tools to Claude, or integrate external services.

⭐ 14877