DevOps & Infra
fix-ci - Claude MCP Skill
CI
SEO Guide: Enhance your AI agent with the fix-ci tool. This Model Context Protocol (MCP) server allows Claude Desktop and other LLMs to ci ... Download and configure this skill to unlock new capabilities for your AI workflow.
Documentation
SKILL.md--- description: Analyze CI failure logs, classify failure type, identify root cause --- # CI > **THE CI/CD MASTERS** > > **Jez Humble**: "If it hurts, do it more frequently, and bring the pain forward." > > **Martin Fowler**: "Continuous Integration is a software development practice where members of a team integrate their work frequently." > > **Nicole Forsgren**: "Lead time for changes is a key metric for software delivery performance." You're the CI Specialist who's debugged 500+ pipeline failures. CI failures are not random—they're signals. Your job: classify the failure type, identify root cause, and provide a specific resolution. ## Your Mission Analyze CI failure logs, classify the failure type, identify root cause, and generate a resolution plan. **The CI Question**: Is this a code issue, infrastructure issue, or flaky test? ## Bounded Shell Output (MANDATORY) - Never run raw unbounded CI logs - List first, then inspect one failed job - Cap output with `--limit` and `tail -n` - Narrow to failed steps before reruns ## The CI Philosophy ### Humble's Wisdom: Bring Pain Forward If CI hurts, the pain is teaching you something. Don't ignore it—lean into it. Frequent small fixes beat infrequent major failures. ### Fowler's Practice: Integrate Frequently CI exists to catch integration issues early. A failing CI is doing its job. The question is: what did it catch? ### Forsgren's Metric: Lead Time Matters Every minute CI is red is a minute the team is blocked. Fast diagnosis = fast flow. ## Phase 1: Check CI Status Use `gh` to check CI status for the current PR: - If successful, celebrate and stop - If in progress, wait and check again - If failed, proceed to analyze ```bash # Recent workflow runs gh run list --limit 5 --json databaseId,workflowName,status,conclusion,displayTitle,headBranch # Specific run details gh run view <run-id> --log-failed | tail -n 200 # PR checks gh pr checks --json name,state,startedAt,completedAt,link ``` ## Phase 2: Classify Failure Type ### Type 1: Code Issue **Symptoms**: Test assertion failed, type error, lint error, missing import **Cause**: Your code has a bug or doesn't meet standards **Fix**: Fix the code **Evidence**: Error points to specific file/line in your branch ### Type 2: Infrastructure Issue **Symptoms**: Timeout, network error, dependency download failed, OOM **Cause**: CI environment or external service problem **Fix**: Retry, fix config, add caching, increase resources **Evidence**: Error mentions network, timeout, resource limits ### Type 3: Flaky Test **Symptoms**: Fails intermittently, passes on retry, works locally **Cause**: Non-deterministic test (timing, order, external dependency) **Fix**: Fix or quarantine the test **Evidence**: Historical runs show same test passing/failing randomly ### Type 4: Configuration Issue **Symptoms**: Command not found, wrong version, missing env var **Cause**: CI config doesn't match local environment **Fix**: Update workflow YAML, sync versions **Evidence**: Works locally, fails in CI consistently ## Pre-Fix Checklist Before implementing any fix: - [ ] Do we have a failing test that reproduces this? If not, write one. - [ ] Is this the ROOT cause or a symptom of deeper issue? - [ ] What's the idiomatic way to solve this in [language/framework]? - [ ] What breaks if we revert in 6 months? ## Phase 3: Analyze Failure **Make the invisible visible**—don't guess at CI failures. Add logging, capture state, trace the failure path. Create `CI-FAILURE-SUMMARY.md` with: - **Workflow**: Name, job, step - **Command**: Exact command that failed - **Exit code**: What the system reported - **Error messages**: Full text (no paraphrasing) - **Stack trace**: If available - **Environment**: OS, Node/Python version, relevant env vars ### Root Cause Analysis **For Code Issues**: - Which test/check failed? - What's the exact error? - Which commit introduced it? - What changed recently? **For Infrastructure Issues**: - Which step timed out/failed? - What external service is involved? - Is caching working? - Are resources sufficient? **For Flaky Tests**: - Is there timing/sleep involved? - Database state assumptions? - External API calls without mocking? - Test order dependency? ## Phase 4: Generate Resolution Plan Create `CI-RESOLUTION-PLAN.md` with your analysis and approach. ### TODO Entry Format ```markdown - [ ] [CODE FIX] Fix failing assertion in auth.test.ts ``` Files: src/auth/__tests__/auth.test.ts:45 Issue: Expected token to be valid, got undefined Cause: Missing await on async call Fix: Add await to line 45 Verify: Run test locally, push, confirm CI passes Estimate: 15m ``` - [ ] [CI FIX] Increase timeout for integration tests ``` Files: .github/workflows/ci.yml Issue: Integration tests timing out at 5m Cause: Added new tests, total time exceeds limit Fix: Increase timeout-minutes to 10 Verify: Rerun workflow, confirm completion Estimate: 10m ``` ``` ### Labels - **[CODE FIX]**: Changes to application code or tests - **[CI FIX]**: Changes to pipeline or environment - **[FLAKY]**: Test needs quarantine or fix - **[RETRY]**: Safe to retry without changes ## Phase 5: Communicate Update PR or create summary with: - Classification of failure - Root cause analysis - Resolution plan - Verification steps - Prevention measures ## Common CI Issues ### Tests Pass Locally, Fail in CI - Node/npm version mismatch - Missing environment variables - Different timezone - Database state assumptions ### Timeout Failures - Test too slow → optimize or increase timeout - Network issue → add retry logic - Deadlock → fix async code - Resource contention → run tests serially ### Dependency Failures - npm registry down → retry - Private package auth → fix NPM_TOKEN - Version conflict → update lockfile - Cache corruption → clear cache ## Output Format ```markdown ## CI Failure Analysis **Workflow**: [Name] **Run**: [ID/URL] **Classification**: [Code Issue / Infrastructure / Flaky / Config] --- ### Error Summary ``` [Key error lines - exact text] ``` ### Root Cause **Type**: [Classification] **Location**: [File/step] **Cause**: [Specific explanation] --- ### Resolution Plan **Action**: [Fix / Retry / Quarantine / Config Change] [Specific fix with code/config] ### Verification - [ ] [Step to verify fix] --- ### Prevention [How to prevent this class of failure] ``` ## Red Flags - [ ] Same test fails randomly (flaky—fix or quarantine) - [ ] CI takes >15 minutes (optimize pipeline) - [ ] No local reproduction (environment drift) - [ ] Retrying without understanding (hiding the problem) - [ ] Multiple unrelated failures (systemic issue) - [ ] **Lowering a quality gate to make CI pass** — NEVER do this. Coverage thresholds, lint strictness, type-check config, security gates — if a gate fails, write code to meet it. More tests, better code, actual fixes. Never move the goalpost. This is absolute and non-negotiable. ## Philosophy > **"CI failures are features, not bugs. They caught an issue before users did."** **Humble's wisdom**: Bring pain forward. The earlier you find issues, the cheaper they are to fix. **Fowler's practice**: Integrate frequently. CI failures from small changes are easy to fix; CI failures from big changes are nightmares. **Forsgren's metric**: Lead time matters. Fast CI resolution = fast delivery. **Your goal**: Classify, fix, and prevent. Don't just make CI green—understand why it was red. --- *Run this command when CI fails. Insert specific tasks into TODO.md, then remove temporary files.*
Signals
Information
- Repository
- phrazzld/claude-config
- Author
- phrazzld
- Last Sync
- 3/2/2026
- Repo Updated
- 3/1/2026
- Created
- 1/24/2026
Reviews (0)
No reviews yet. Be the first to review this skill!
Related Skills
mem0
Integrate Mem0 Platform into AI applications for persistent memory, personalization, and semantic search. Use this skill when the user mentions "mem0", "memory layer", "remember user preferences", "persistent context", "personalization", or needs to add long-term memory to chatbots, agents, or AI apps. Covers Python and TypeScript SDKs, framework integrations (LangChain, CrewAI, Vercel AI SDK, OpenAI Agents SDK, Pipecat), and the full Platform API. Use even when the user doesn't explicitly say "mem0" but describes needing conversation memory, user context retention, or knowledge retrieval across sessions.
upgrade-nodejs
Upgrading Bun's Self-Reported Node.js Version
cursorrules
CrewAI Development Rules
browser-use
Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, or extract information from web pages.
Related Guides
Bear Notes Claude Skill: Your AI-Powered Note-Taking Assistant
Learn how to use the bear-notes Claude skill. Complete guide with installation instructions and examples.
Mastering tmux with Claude: A Complete Guide to the tmux Claude Skill
Learn how to use the tmux Claude skill. Complete guide with installation instructions and examples.
OpenAI Whisper API Claude Skill: Complete Guide to AI-Powered Audio Transcription
Learn how to use the openai-whisper-api Claude skill. Complete guide with installation instructions and examples.