Cloud Resilience and Dependency Auditor — Find Your Single Points of Failure Before the Next Outage Does
Map your real cloud dependency tree — data plane and control plane — find the single points of failure ranked by blast radius, catch the classic traps like monitoring that dies with the region it watches, and get a prioritized resilience backlog. The audit every outage post-mortem says to do first.
- Map data plane and control plane dependencies to find hidden couplings.
- Rank infrastructure failure points by blast radius and recovery impact.
- Identify fate-sharing risks in observability and deployment tooling.
$14
· or 70 creditsSecure checkout via Stripe
Included in download
- Map data plane and control plane dependencies to find hidden couplings.
- Rank infrastructure failure points by blast radius and recovery impact.
- file_read, file_write automation included
- Ready for Cursor
Media gallery
See it in action
You say
Audit our resilience. We run on AWS with us-east-1 as primary and a warm standby in us-west-2. Core services: EKS for compute, RDS Postgres (primary in us-east-1), Secrets Manager in us-east-1, Auth0 for identity, Cloudflare for DNS and CDN, SQS for queues, and Datadog for observability. CI/CD is GitHub Actions. Map our real dependency tree, find the single points of failure ranked by blast radius, and give me a prioritized resilience backlog.
Your agent does
Dependency map (excerpt): Control plane — Auth0 (global identity), AWS Secrets Manager (us-east-1), Cloudflare DNS (global). Data plane — EKS, RDS Postgres, SQS. Shared-fate risk: Datadog and your recovery runbooks both live in us-east-1, the region most likely to be failing.
Ranked single-point-of-failure register:
- Secrets Manager (us-east-1) — Blast radius: HIGH. Your us-west-2 standby reads secrets from us-east-1 to rotate credentials and restart pods, so it cannot recover while us-east-1 is down. Recover-while-down? No.
- Auth0 (global) — Blast radius: HIGH. Universal gatekeeper; if identity is unavailable, no service authenticates, including your admin tooling. Recover-while-down? Partial.
- Observability in us-east-1 — Blast radius: MEDIUM. You go blind during the exact outage you need to see. Recover-while-down? No.
Resilience backlog (blast radius x likelihood):
- P1: Enable cross-region secret replication to us-west-2. Fix effort: ~M.
- P1: Add a break-glass auth path independent of Auth0 for operators. Fix effort: ~M.
- P2: Move observability/alerting to a region or provider that does not share fate with primary. Fix effort: ~L.
Accepted-risk register: Cloudflare DNS treated as accepted global dependency this quarter; revisit with secondary DNS provider next planning cycle.
Cloud Resilience and Dependency Auditor — Find Your Single Points of Failure Before the Next Outage Does
Map your real cloud dependency tree — data plane and control plane — find the single points of failure ranked by blast radius, catch the classic traps like monitoring that dies with the region it watches, and get a prioritized resilience backlog. The audit every outage post-mortem says to do first.
$14
· or 70 creditsSecure checkout via Stripe
Included in download
- Map data plane and control plane dependencies to find hidden couplings.
- Rank infrastructure failure points by blast radius and recovery impact.
- file_read, file_write automation included
- Ready for Cursor
- Instant install
Media gallery
See it in action
You say
Audit our resilience. We run on AWS with us-east-1 as primary and a warm standby in us-west-2. Core services: EKS for compute, RDS Postgres (primary in us-east-1), Secrets Manager in us-east-1, Auth0 for identity, Cloudflare for DNS and CDN, SQS for queues, and Datadog for observability. CI/CD is GitHub Actions. Map our real dependency tree, find the single points of failure ranked by blast radius, and give me a prioritized resilience backlog.
Your agent does
Dependency map (excerpt): Control plane — Auth0 (global identity), AWS Secrets Manager (us-east-1), Cloudflare DNS (global). Data plane — EKS, RDS Postgres, SQS. Shared-fate risk: Datadog and your recovery runbooks both live in us-east-1, the region most likely to be failing.
Ranked single-point-of-failure register:
- Secrets Manager (us-east-1) — Blast radius: HIGH. Your us-west-2 standby reads secrets from us-east-1 to rotate credentials and restart pods, so it cannot recover while us-east-1 is down. Recover-while-down? No.
- Auth0 (global) — Blast radius: HIGH. Universal gatekeeper; if identity is unavailable, no service authenticates, including your admin tooling. Recover-while-down? Partial.
- Observability in us-east-1 — Blast radius: MEDIUM. You go blind during the exact outage you need to see. Recover-while-down? No.
Resilience backlog (blast radius x likelihood):
- P1: Enable cross-region secret replication to us-west-2. Fix effort: ~M.
- P1: Add a break-glass auth path independent of Auth0 for operators. Fix effort: ~M.
- P2: Move observability/alerting to a region or provider that does not share fate with primary. Fix effort: ~L.
Accepted-risk register: Cloudflare DNS treated as accepted global dependency this quarter; revisit with secondary DNS provider next planning cycle.
About This Skill
Every outage post-mortem recommends the same first step, and almost nobody has done it: map every critical service, API, and route your uptime depends on — not just the data plane, but the control plane. The dependency tree is no longer obvious from the invoice, and "we're multi-region" is not the same as "we're resilient," because what takes you down is usually a shared service — DNS, identity, a control plane — that is itself a single point of failure across all your regions. Cloud Resilience and Dependency Auditor does the audit. Describe your stack — providers and regions, and the services you lean on for DNS, identity, CDN, queues, data, secrets, observability, and CI/CD — and it maps the real dependency tree including the control-plane and hidden shared-service couplings people miss, ranks the single points of failure by blast radius, and checks the classic traps: monitoring and recovery tooling that share fate with the environment that's failing (so you go blind exactly when you need clarity), auth as a universal gatekeeper, the default-region hub, and vendors whose dependency tree you inherit. It returns a dependency map, a ranked single-point-of-failure register noting whether you could even recover while each is down, a resilience backlog prioritized by blast radius times likelihood with the specific fix and rough effort for each, and an accepted-risk register for what to consciously live with. The download includes three reference files: the dependency-inventory worksheet, a single-point-of-failure pattern guide, and a worked sample audit. It audits from what you describe — it doesn't scan your infrastructure or test failover, and a plan you haven't rehearsed is a hypothesis. Works with Claude Code, Cursor, Codex CLI, Gemini CLI, and any SKILL.md agent.
Use Cases
- Map data plane and control plane dependencies to find hidden couplings.
- Rank infrastructure failure points by blast radius and recovery impact.
- Identify fate-sharing risks in observability and deployment tooling.
- Develop a prioritized backlog of resilience improvements and failover plans.
Known Limitations
This is an audit that works from the stack you describe — it does not scan live infrastructure, auto-discover resources, connect to your cloud accounts, or test failover. Its output is only as complete as the description you provide, so services you leave out won't be assessed. It identifies single points of failure and prescribes prioritized fixes, but it cannot guarantee uptime or resilience: a recovery plan you haven't actually rehearsed is a hypothesis, not a proven capability. Findings should be validated with real game-day and failover testing.
How to install
Drop the file into your AI tool. Works with Claude, Cursor, ChatGPT, and 20+ more.
Reviews
No reviews yet - be the first to share your experience.
Only users who have downloaded or purchased this skill can leave a review.
Early access skill
Be the first to review this skill.
Only users who have downloaded or purchased this skill can leave a review.
Security Scanned
Passed automated security review
Permissions
File Scopes
This skill only reads the stack description and reference files you provide and writes its audit outputs (dependency map, single-point-of-failure register, resilience backlog, and accepted-risk register) back as local files. It does not use a terminal, browser, network access, or environment variables, and it never connects to your cloud accounts or scans live infrastructure. File scope references/** covers the three bundled reference files: dependency-inventory-worksheet.md, spof-pattern-guide.md, and sample-resilience-audit.md.
Tags
Works with any agent that supports the open SKILL.md standard — including Claude Code, Cursor, Codex CLI, and Gemini CLI. No special runtime, cloud credentials, or network access required; the skill only needs to read the stack description and reference files you provide and write its audit outputs as local files.