1

    chaos-engineering

    by Frank Brsrk

    Design rigorous chaos engineering experiments and resilience audits to verify production system reliability.

    Updated May 2026
    0 installs

    Free

    One-time purchase

    ⚡ Also available via Agensi MCP — your AI agent can load this skill on demand via MCP. Learn more →

    Included in download

    • Downloadable skill package
    • Works with Cursor, Windsurf
    • Instant install

    See it in action

    Hypothesis: P99 latency for /checkout remains <1.2s during payment gateway latency.
    Perturbation: Inject 300ms latency on the 'payment-v2' service for 5% of traffic for 10 mins.
    Abort Condition: Error rate > 2% for 120s.
    Targeted Amplifier: Retry storm and thread-pool exhaustion.

    About This Skill

    The Science of Controlled Failure

    Moving beyond generic checklists, this skill transforms your AI agent into a senior Chaos Engineer. It addresses the fundamental problem of "theoretical resilience" by replacing vague recommendations with falsifiable, evidence-based experimitalic textents. Instead of suggesting you "add retries," it helps you design the exact stress test needed to prove your system won't collapse under a retry storm.

    What it does

    • Experiment Design: Drafts specific chaos experiments with measurable hypotheses, single-variable perturbations, and defined blast radii.
    • Resilience Auditing: Identifies hidden architectural amplifiers like thundering herds, gray failures, and synchronized backoffs.
    • Operational Rigor: Defines the human roles (Lead, Observer, Abort Authority) and readiness flags required to run experiments safely in production.
    • Post-Mortem Conversion: Analyzes past incidents to create "never again" experiments that verify fixes.

    Why use this skill?

    Standard AI prompting often results in "best practice" lists that are difficult to action. This skill enforces a rigorous four-phase procedure (Hypothesize, Perturb, Minimize, Learn) that treats infrastructure as a laboratory. It focuses on tail-risk (P99/P99.9) rather than averages, ensuring your systems are hardened against the worst-case scenarios that actually cause outages.

    📖 Learn more: Best Testing & QA Skills for Claude Code →

    Use Cases

    • Design controlled fault-injection experiments for production environments.
    • Identify single points of failure in distributed microservices architectures.
    • Plan high-stakes 'Game Day' simulations for engineering teams.
    • Audit architecture for 'gray failures' and hidden system-coupling amplifiers.
    • Specify measurable safety bounds and abort conditions for reliability tests.

    Reviews

    No reviews yet - be the first to share your experience.

    Only users who have downloaded or purchased this skill can leave a review.

    Security Scanned

    Passed automated security review

    Permissions

    No special permissions declared or detected

    Works with any SKILL.md-compatible agent: Claude Code, Cursor, Windsurf, Codex CLI, Gemini CLI, GitHub Copilot. No external API key required. No Node setup. No MCP configuration.

    Creator

    Building Ejentum, a cognitive harness API for AI agents. Small structured pieces of context retrieved at inference time, so the agent reasons through a task instead of pattern-matching to a generic answer. Adversarial systems thinking is the other thing I'm into: chaos engineering, pre-mortems, blast-radius design. Those skills sit alongside the Ejentum harness wrappers on this profile. Solo builder, open source most of what I make.

    Frequently Asked Questions

    Similar Skills

    Free