Chaos Engineering
by Frank Brsrk
Design rigorous chaos engineering experiments and resilience audits to verify production system reliability.
- Design controlled fault-injection experiments for production environments.
- Identify single points of failure in distributed microservices architectures.
- Plan high-stakes 'Game Day' simulations for engineering teams.
Free
Included in download
- Downloadable skill package
- Works with Cursor, Windsurf
Building Ejentum, a cognitive harness API for AI agents. Small structured pieces of context retrieved at inference time,…
Media gallery
See it in action
You say
Our payment gateway has been slow lately. Can you design a chaos experiment to see how the checkout process handles a latency spike in the payment service?
Your agent does
Hypothesis: P99 latency for /checkout remains <1.2s during payment gateway latency. Perturbation: Inject 300ms latency on the 'payment-v2' service for 5% of traffic for 10 mins. Abort Condition: Error rate > 2% for 120s. Targeted Amplifier: Retry storm and thread-pool exhaustion.
Chaos Engineering
by Frank Brsrk
Design rigorous chaos engineering experiments and resilience audits to verify production system reliability.
Free
Included in download
- Downloadable skill package
- Works with Cursor, Windsurf
- Instant install
Media gallery
See it in action
You say
Our payment gateway has been slow lately. Can you design a chaos experiment to see how the checkout process handles a latency spike in the payment service?
Your agent does
Hypothesis: P99 latency for /checkout remains <1.2s during payment gateway latency. Perturbation: Inject 300ms latency on the 'payment-v2' service for 5% of traffic for 10 mins. Abort Condition: Error rate > 2% for 120s. Targeted Amplifier: Retry storm and thread-pool exhaustion.
About This Skill
The Science of Controlled Failure
Moving beyond generic checklists, this skill transforms your AI agent into a senior Chaos Engineer. It addresses the fundamental problem of "theoretical resilience" by replacing vague recommendations with falsifiable, evidence-based experimitalic textents. Instead of suggesting you "add retries," it helps you design the exact stress test needed to prove your system won't collapse under a retry storm.
What it does
- Experiment Design: Drafts specific chaos experiments with measurable hypotheses, single-variable perturbations, and defined blast radii.
- Resilience Auditing: Identifies hidden architectural amplifiers like thundering herds, gray failures, and synchronized backoffs.
- Operational Rigor: Defines the human roles (Lead, Observer, Abort Authority) and readiness flags required to run experiments safely in production.
- Post-Mortem Conversion: Analyzes past incidents to create "never again" experiments that verify fixes.
Why use this skill?
Standard AI prompting often results in "best practice" lists that are difficult to action. This skill enforces a rigorous four-phase procedure (Hypothesize, Perturb, Minimize, Learn) that treats infrastructure as a laboratory. It focuses on tail-risk (P99/P99.9) rather than averages, ensuring your systems are hardened against the worst-case scenarios that actually cause outages.
Use Cases
- Design controlled fault-injection experiments for production environments.
- Identify single points of failure in distributed microservices architectures.
- Plan high-stakes 'Game Day' simulations for engineering teams.
- Audit architecture for 'gray failures' and hidden system-coupling amplifiers.
- Specify measurable safety bounds and abort conditions for reliability tests.
Known Limitations
- Planner only: the skill designs experiments but does not execute them. You run the experiments using your own tools (Gremlin, Litmus, Chaos Mesh, AWS FIS, custom tooling).
- Garbage in, garbage out on system context: the agent does not know your specific architecture. You describe the system and dependencies; the agent designs experiments against what you describe. Undocumented dependencies will not be caught.
- Best for systems with observable telemetry. Architectures lacking dashboards, P99 latency tracking, or error-rate alerting will hit a wall at the steady-state hypothesis phase.
- Not a substitute for post-mortem culture. The skill plans experiments and learns from outcomes; it does not run retrospectives or write incident reports.
- Single-experiment focus: the skill designs one experiment at a time. Continuous chaos automation strategy (Chaos Monkey-style ongoing fleet experiments) requires additional tooling and program design beyond what the skill teaches.
- Vocabulary assumes mainstream distributed-systems patterns (Kubernetes, cloud, microservices, retries, circuit breakers). Less directly applicable to highly proprietary or unusual architectures without translation.
How to Install
mkdir -p ~/.claude/skills && curl -sL https://www.agensi.io/api/install/chaos-engineering -o /tmp/chaos-engineering.zip && unzip -o /tmp/chaos-engineering.zip -d ~/.claude/skills && rm /tmp/chaos-engineering.zipFree skills install directly. Paid skills require purchase - use the download button above after buying.
Reviews
Security Scanned
Passed automated security review
Permissions
No special permissions declared or detected
Works with any SKILL.md-compatible agent: Claude Code, Cursor, Windsurf, Codex CLI, Gemini CLI, GitHub Copilot. No external API key required. No Node setup. No MCP configuration.
Creator
Building Ejentum, a cognitive harness API for AI agents. Small structured pieces of context retrieved at inference time, so the agent reasons through a task instead of pattern-matching to a generic answer. Adversarial systems thinking is the other thing I'm into: chaos engineering, pre-mortems, blast-radius design. Those skills sit alongside the Ejentum harness wrappers on this profile. Solo builder, open source most of what I make.
Frequently Asked Questions
Learn More About AI Agent Skills
More Premium Skills
Multi-Agent Orchestration Master Library
Transform Claude Code into a coordinated multi-agent system. Battle-tested tmux orchestration patterns, YAML task queues, event-driven communication, and parallel worker management for 8+ agents.

Legacy Code Modernization Planner for AI Coding Agents
Creates safe modernization roadmaps for old, messy, undocumented, or fragile codebases, including risk audits, refactor phases, dependency reviews, testing plans, migration steps, and AI coding prompts.
designing-hybrid-context-layers
Architects the right retrieval strategy for every query — teaching your agent when to use RAG, a knowledge graph, or a temporal index instead of defaulting to vector search for everything.
ai-automation-qa-pack
Professional QA & UAT documentation generator for AI automation agencies and complex agent deployments.