Cloud Resilience and Dependency Auditor — Find Your Single Points of Failure Before the Next Outage Does

Map your real cloud dependency tree — data plane and control plane — find the single points of failure ranked by blast radius, catch the classic traps like monitoring that dies with the region it watches, and get a prioritized resilience backlog. The audit every outage post-mortem says to do first.

Updated Jul 2026

Map data plane and control plane dependencies to find hidden couplings.
Rank infrastructure failure points by blast radius and recovery impact.
Identify fate-sharing risks in observability and deployment tooling.

Cursor

Codex CLI

and Gemini CLI. No special runtime

cloud credentials

Security scannedInstant install

$14

· or 70 credits

30-day refund guarantee

Secure checkout via Stripe

Included in download

Map data plane and control plane dependencies to find hidden couplings.
Rank infrastructure failure points by blast radius and recovery impact.
file_read, file_write automation included
Ready for Cursor

See it in action

You say

Audit our resilience. We run on AWS with us-east-1 as primary and a warm standby in us-west-2. Core services: EKS for compute, RDS Postgres (primary in us-east-1), Secrets Manager in us-east-1, Auth0 for identity, Cloudflare for DNS and CDN, SQS for queues, and Datadog for observability. CI/CD is GitHub Actions. Map our real dependency tree, find the single points of failure ranked by blast radius, and give me a prioritized resilience backlog.

Your agent does

Dependency map (excerpt): Control plane — Auth0 (global identity), AWS Secrets Manager (us-east-1), Cloudflare DNS (global). Data plane — EKS, RDS Postgres, SQS. Shared-fate risk: Datadog and your recovery runbooks both live in us-east-1, the region most likely to be failing.

Ranked single-point-of-failure register:

Secrets Manager (us-east-1) — Blast radius: HIGH. Your us-west-2 standby reads secrets from us-east-1 to rotate credentials and restart pods, so it cannot recover while us-east-1 is down. Recover-while-down? No.
Auth0 (global) — Blast radius: HIGH. Universal gatekeeper; if identity is unavailable, no service authenticates, including your admin tooling. Recover-while-down? Partial.
Observability in us-east-1 — Blast radius: MEDIUM. You go blind during the exact outage you need to see. Recover-while-down? No.

Resilience backlog (blast radius x likelihood):

P1: Enable cross-region secret replication to us-west-2. Fix effort: ~M.
P1: Add a break-glass auth path independent of Auth0 for operators. Fix effort: ~M.
P2: Move observability/alerting to a region or provider that does not share fate with primary. Fix effort: ~L.

Accepted-risk register: Cloudflare DNS treated as accepted global dependency this quarter; revisit with secondary DNS provider next planning cycle.

Cloud Resilience and Dependency Auditor — Find Your Single Points of Failure Before the Next Outage Does

Name: Cloud Resilience and Dependency Auditor — Find Your Single Points of Failure Before the Next Outage Does
Price: 14 USD
Availability: InStock
Author: Agensi

Updated Jul 2026

Security scanned

Cursor

$14

· or 70 credits

30-day refund guarantee

Secure checkout via Stripe

⚡ Also available via Agensi MCP - your AI agent can load this skill on demand via MCP. Learn more →

Included in download

Map data plane and control plane dependencies to find hidden couplings.
Rank infrastructure failure points by blast radius and recovery impact.
file_read, file_write automation included
Ready for Cursor
Instant install

See it in action

You say

Your agent does

Ranked single-point-of-failure register:

Secrets Manager (us-east-1) — Blast radius: HIGH. Your us-west-2 standby reads secrets from us-east-1 to rotate credentials and restart pods, so it cannot recover while us-east-1 is down. Recover-while-down? No.
Auth0 (global) — Blast radius: HIGH. Universal gatekeeper; if identity is unavailable, no service authenticates, including your admin tooling. Recover-while-down? Partial.
Observability in us-east-1 — Blast radius: MEDIUM. You go blind during the exact outage you need to see. Recover-while-down? No.

Resilience backlog (blast radius x likelihood):

P1: Enable cross-region secret replication to us-west-2. Fix effort: ~M.
P1: Add a break-glass auth path independent of Auth0 for operators. Fix effort: ~M.
P2: Move observability/alerting to a region or provider that does not share fate with primary. Fix effort: ~L.

Accepted-risk register: Cloudflare DNS treated as accepted global dependency this quarter; revisit with secondary DNS provider next planning cycle.

Security scanned

About This Skill

Every outage post-mortem recommends the same first step, and almost nobody has done it: map every critical service, API, and route your uptime depends on — not just the data plane, but the control plane. The dependency tree is no longer obvious from the invoice, and "we're multi-region" is not the same as "we're resilient," because what takes you down is usually a shared service — DNS, identity, a control plane — that is itself a single point of failure across all your regions. Cloud Resilience and Dependency Auditor does the audit. Describe your stack — providers and regions, and the services you lean on for DNS, identity, CDN, queues, data, secrets, observability, and CI/CD — and it maps the real dependency tree including the control-plane and hidden shared-service couplings people miss, ranks the single points of failure by blast radius, and checks the classic traps: monitoring and recovery tooling that share fate with the environment that's failing (so you go blind exactly when you need clarity), auth as a universal gatekeeper, the default-region hub, and vendors whose dependency tree you inherit. It returns a dependency map, a ranked single-point-of-failure register noting whether you could even recover while each is down, a resilience backlog prioritized by blast radius times likelihood with the specific fix and rough effort for each, and an accepted-risk register for what to consciously live with. The download includes three reference files: the dependency-inventory worksheet, a single-point-of-failure pattern guide, and a worked sample audit. It audits from what you describe — it doesn't scan your infrastructure or test failover, and a plan you haven't rehearsed is a hypothesis. Works with Claude Code, Cursor, Codex CLI, Gemini CLI, and any SKILL.md agent.

Use Cases

Map data plane and control plane dependencies to find hidden couplings.
Rank infrastructure failure points by blast radius and recovery impact.
Identify fate-sharing risks in observability and deployment tooling.
Develop a prioritized backlog of resilience improvements and failover plans.

Known Limitations

This is an audit that works from the stack you describe — it does not scan live infrastructure, auto-discover resources, connect to your cloud accounts, or test failover. Its output is only as complete as the description you provide, so services you leave out won't be assessed. It identifies single points of failure and prescribes prioritized fixes, but it cannot guarantee uptime or resilience: a recovery plan you haven't actually rehearsed is a hypothesis, not a proven capability. Findings should be validated with real game-day and failover testing.

How to install

Drop the file into your AI tool. Works with Claude, Cursor, ChatGPT, and 20+ more.

Reviews

No reviews yet - be the first to share your experience.

Only users who have downloaded or purchased this skill can leave a review.

Early access skill

Security scanned

Works with any agent that supports the open SKILL.md stan…

Be the first to review this skill.

Only users who have downloaded or purchased this skill can leave a review.

Security Scanned

Passed automated security review

Permissions

Read Files

Write Files

File Scopes

references/**

This skill only reads the stack description and reference files you provide and writes its audit outputs (dependency map, single-point-of-failure register, resilience backlog, and accepted-risk register) back as local files. It does not use a terminal, browser, network access, or environment variables, and it never connects to your cloud accounts or scans live infrastructure. File scope references/** covers the three bundled reference files: dependency-inventory-worksheet.md, spof-pattern-guide.md, and sample-resilience-audit.md.