Data & AI
llm-evaluation - Claude MCP Skill
LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo. Invoke when: - Setting up prompt evaluation or regression testing - Integrating LLM testing into CI/CD pipelines - Configuring security testing (red teaming, jailbreaks) - Comparing prompt or model performance - Building evaluation suites for RAG, factuality, or safety Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing
SEO Guide: Enhance your AI agent with the llm-evaluation tool. This Model Context Protocol (MCP) server allows Claude Desktop and other LLMs to llm prompt testing, evaluation, and ci/cd quality gates using promptfoo. invoke when: - setting up p... Download and configure this skill to unlock new capabilities for your AI workflow.
Documentation
SKILL.md# LLM Evaluation & Testing
Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.
## Quick Start
```bash
# Initialize Promptfoo (no global install needed)
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
# Run security scan
npx promptfoo@latest redteam run
```
## Core Concepts
### Why Evaluate?
LLM outputs are non-deterministic. "It looks good" isn't testing. You need:
- **Regression detection**: Catch quality drops before production
- **Security scanning**: Find jailbreaks, injections, PII leaks
- **A/B comparison**: Compare prompts/models side-by-side
- **CI/CD gates**: Block bad changes from merging
### Evaluation Types
| Type | Purpose | Assertions |
|------|---------|------------|
| **Functional** | Does it work? | `contains`, `equals`, `is-json` |
| **Semantic** | Is it correct? | `similar`, `llm-rubric`, `factuality` |
| **Performance** | Is it fast/cheap? | `cost`, `latency` |
| **Security** | Is it safe? | `redteam`, `moderation`, `pii-detection` |
## Configuration
### Basic promptfooconfig.yaml
```yaml
description: "My LLM evaluation suite"
prompts:
- file://prompts/main.txt
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-sonnet-latest
tests:
- vars:
question: "What is the capital of France?"
assert:
- type: contains
value: "Paris"
- type: cost
threshold: 0.01
- vars:
question: "Explain quantum computing"
assert:
- type: llm-rubric
value: "Response explains quantum computing concepts clearly"
- type: latency
threshold: 3000
```
### With Environment Variables
```yaml
providers:
- id: openrouter:anthropic/claude-3-5-sonnet
config:
apiKey: ${OPENROUTER_API_KEY}
```
## Assertions Reference
### Basic Assertions
```yaml
assert:
# String matching
- type: contains
value: "expected text"
- type: not-contains
value: "forbidden text"
- type: equals
value: "exact match"
- type: starts-with
value: "prefix"
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}" # Date pattern
# JSON validation
- type: is-json
- type: is-valid-json-schema
value:
type: object
properties:
name: { type: string }
required: [name]
```
### Semantic Assertions
```yaml
assert:
# Semantic similarity (embeddings)
- type: similar
value: "The capital of France is Paris"
threshold: 0.8 # 0-1 similarity score
# LLM-as-judge with custom criteria
- type: llm-rubric
value: |
Response must:
1. Be factually accurate
2. Be under 100 words
3. Not contain marketing language
# Factuality check against reference
- type: factuality
value: "Paris is the capital of France"
```
### Performance Assertions
```yaml
assert:
# Cost budget (USD)
- type: cost
threshold: 0.05 # Max $0.05 per request
# Latency (milliseconds)
- type: latency
threshold: 2000 # Max 2 seconds
```
### Security Assertions
```yaml
assert:
# Content moderation
- type: moderation
value: violence
# PII detection
- type: not-contains
value: "{{email}}" # From test vars
```
## CI/CD Integration
### GitHub Action
```yaml
name: 'Prompt Evaluation'
on:
pull_request:
paths: ['prompts/**', 'src/**/*prompt*']
jobs:
evaluate:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
# Cache for faster runs
- uses: actions/cache@v4
with:
path: ~/.promptfoo
key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
# Run evaluation and post results to PR
- uses: promptfoo/promptfoo-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }} # Or other provider keys
```
### Quality Gates
```yaml
# promptfooconfig.yaml
evaluateOptions:
# Fail if any assertion fails
maxConcurrency: 5
# Or set pass threshold
threshold: 0.9 # 90% of tests must pass
```
### Output to JSON (for custom CI)
```bash
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
# Check results in CI script
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then
echo "Evaluation failed!"
exit 1
fi
```
## Security Testing (Red Team)
### Quick Scan
```bash
# Run red team against your prompts
npx promptfoo@latest redteam run
# Generate compliance report
npx promptfoo@latest redteam report --output compliance.html
```
### Configuration
```yaml
# promptfooconfig.yaml
redteam:
purpose: "Customer support chatbot"
plugins:
- harmful:hate
- harmful:violence
- harmful:self-harm
- pii:direct
- pii:session
- hijacking
- jailbreak
- prompt-injection
strategies:
- jailbreak
- prompt-injection
- base64
- leetspeak
```
### OWASP Top 10 Coverage
```yaml
redteam:
plugins:
# 1. Prompt Injection
- prompt-injection
# 2. Insecure Output Handling
- harmful:privacy
# 3. Training Data Poisoning (N/A for evals)
# 4. Model Denial of Service
- excessive-agency
# 5. Supply Chain (N/A for evals)
# 6. Sensitive Information Disclosure
- pii:direct
- pii:session
# 7. Insecure Plugin Design
- hijacking
# 8. Excessive Agency
- excessive-agency
# 9. Overreliance (use factuality checks)
# 10. Model Theft (N/A for evals)
```
## RAG Evaluation
### Context-Aware Testing
```yaml
prompts:
- |
Context: {{context}}
Question: {{question}}
Answer based only on the context provided.
tests:
- vars:
context: "The Eiffel Tower was built in 1889 for the World's Fair."
question: "When was the Eiffel Tower built?"
assert:
- type: contains
value: "1889"
- type: factuality
value: "The Eiffel Tower was built in 1889"
- type: not-contains
value: "1900" # Common hallucination
```
### Retrieval Quality
```yaml
# Test that retrieval returns relevant documents
tests:
- vars:
query: "Python list comprehension"
assert:
- type: llm-rubric
value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains
value: "I don't know" # Shouldn't punt on this query
```
## Comparing Models/Prompts
### A/B Testing
```yaml
# Compare two prompts
prompts:
- file://prompts/v1.txt
- file://prompts/v2.txt
# Same tests for both
tests:
- vars: { question: "Explain recursion" }
assert:
- type: llm-rubric
value: "Clear explanation of recursion with example"
```
### Model Comparison
```yaml
# Compare multiple models
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-latest
- openrouter:google/gemini-flash-1.5
# Run: npx promptfoo@latest eval
# View: npx promptfoo@latest view
# Compare cost, latency, quality side-by-side
```
## Best Practices
### 1. Golden Test Cases
Maintain a set of critical test cases that must always pass:
```yaml
# golden-tests.yaml
tests:
- description: "Core functionality - must pass"
vars:
input: "critical test case"
assert:
- type: contains
value: "expected output"
options:
critical: true # Fail entire suite if this fails
```
### 2. Regression Suite Structure
```
prompts/
āāā production.txt # Current production prompt
āāā candidate.txt # New prompt being tested
tests/
āāā golden/ # Critical tests (run on every PR)
ā āāā core-functionality.yaml
āāā regression/ # Full regression suite (nightly)
ā āāā full-suite.yaml
āāā security/ # Red team tests
āāā redteam.yaml
```
### 3. Test Categories
```yaml
tests:
# Happy path
- description: "Standard query"
vars: { question: "What is 2+2?" }
assert:
- type: contains
value: "4"
# Edge cases
- description: "Empty input"
vars: { question: "" }
assert:
- type: not-contains
value: "error"
# Adversarial
- description: "Injection attempt"
vars: { question: "Ignore previous instructions and..." }
assert:
- type: not-contains
value: "Here's how to" # Should refuse
```
## References
- `references/promptfoo-guide.md` - Detailed setup and configuration
- `references/evaluation-metrics.md` - Metrics deep dive
- `references/ci-cd-integration.md` - CI/CD patterns
- `references/alternatives.md` - Braintrust, DeepEval, LangSmith comparison
## Templates
Copy-paste ready templates:
- `templates/promptfooconfig.yaml` - Basic config
- `templates/github-action-eval.yml` - GitHub Action
- `templates/regression-test-suite.yaml` - Full regression suiteSignals
Information
- Repository
- phrazzld/claude-config
- Author
- phrazzld
- Last Sync
- 3/2/2026
- Repo Updated
- 3/1/2026
- Created
- 1/13/2026
Reviews (0)
No reviews yet. Be the first to review this skill!
Related Skills
upgrade-nodejs
Upgrading Bun's Self-Reported Node.js Version
cursorrules
CrewAI Development Rules
cn-check
Install and run the Continue CLI (`cn`) to execute AI agent checks on local code changes. Use when asked to "run checks", "lint with AI", "review my changes with cn", or set up Continue CI locally.
CLAUDE
CLAUDE.md
Related Guides
Bear Notes Claude Skill: Your AI-Powered Note-Taking Assistant
Learn how to use the bear-notes Claude skill. Complete guide with installation instructions and examples.
Mastering tmux with Claude: A Complete Guide to the tmux Claude Skill
Learn how to use the tmux Claude skill. Complete guide with installation instructions and examples.
OpenAI Whisper API Claude Skill: Complete Guide to AI-Powered Audio Transcription
Learn how to use the openai-whisper-api Claude skill. Complete guide with installation instructions and examples.