Observability Slo Architect

Design SLOs/SLIs for any service. Generates Prometheus rules, Grafana dashboards, burn-rate alerts. Google's 2%/50% model.

Updated Jun 2026

Stop fighting fires without measuring. Define SLOs/SLIs, calculate error budgets, generate Prometheus rules + Grafana dashboards + PagerDuty alerts + runbook templates. With 8-framework SLO selection guide and burn-rate alerts.

Security scannedInstant install

Free

Included in download

Downloadable skill package
1 permission declared

Kaymue

Observability Slo Architect

Name: Observability Slo Architect
Availability: InStock
Author: Agensi

by Kaymue

Design SLOs/SLIs for any service. Generates Prometheus rules, Grafana dashboards, burn-rate alerts. Google's 2%/50% model.

Updated Jun 2026

0 installs

Free

⚡ Also available via Agensi MCP - your AI agent can load this skill on demand via MCP. Learn more →

Included in download

Downloadable skill package
1 permission declared
Instant install

0 installs

Works with any agent that s…

About This Skill

# Observability SLO Architect You don't have a reliability problem, you have a measurement problem. This skill turns "is the service up?" into quantifiable SLOs, error budgets, and alerts that wake the right people at the right time. ## What it does End-to-end SLO engineering: - **SLI selection** — 12 indicators across availability, latency, throughput, durability, correctness - **SLO framework** — pick the right SLO for your service (8 patterns) - **Error budget calculation** — convert SLO to downtime budget per window - **Burn-rate alerts** — fast burn (1h, 6h) vs slow burn (24h, 3d) - **Prometheus rules** — ready-to-use recording + alerting rules - **Grafana dashboards** — SLO compliance, burn rate, error budget remaining - **PagerDuty / Opsgenie policies** — escalation + paging based on burn rate - **Runbook templates** — what to do when SLO is at risk - **Review process** — monthly SLO review template ## When to use it - You're launching a new service and need to define SLOs - Your on-call is exhausted from alerts that don't matter - Customers complain about reliability but you can't quantify it - You want to introduce error budgets but don't know how - Your SLOs are aspirational ("99.9%") but no one's held to them - You need to migrate from "uptime" to SLO-based thinking ## Why it's better than ad-hoc prompting Most "define SLOs" prompts give textbook answers. This skill is different: - **8 specific SLO frameworks** — pick the right one for your service type - **Ready-to-deploy rules** — Prometheus + Grafana, not pseudocode - **Burn-rate alerts done right** — Google's 2% / 50% burn model - **Concrete tradeoffs** — "user-facing latency" vs "internal availability" - **Templates** — copy-paste runbooks, review docs ## Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ Agent (Claude/Cursor) │ │ - Reads service description │ │ - Asks about user journey, criticality │ │ - Generates SLO spec + rules + dashboards │ └───────────────┬─────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ skills/observability-slo-architect/ │ │ scripts/ │ │ ├── design_slo.py # SLI selection + SLO targets │ │ ├── error_budget.py # Budget calculation │ │ ├── gen_prometheus.py # Recording + alerting rules │ │ ├── gen_grafana.py # Dashboard JSON │ │ ├── gen_pagerduty.py # Escalation policy │ │ ├── gen_runbook.py # Markdown runbook │ │ └── review.py # Monthly SLO review template │ │ references/ │ │ ├── 8-frameworks.md # Which SLO pattern when │ │ ├── sli-catalog.md # 12 SLI definitions │ │ ├── burn-rate-guide.md # Fast vs slow burn │ │ └── google-sre-book-notes.md │ │ templates/ │ │ ├── slo-spec.yaml │ │ ├── prometheus-rules.yaml │ │ ├── grafana-dashboard.json │ │ ├── pagerduty-policy.yaml │ │ └── runbook.md.tmpl │ └─────────────────────────────────────────────────────────┘ ``` ## Quick start ```bash # 1. Install pip install pyyaml # 2. Design SLOs for a service python scripts/design_slo.py --service "checkout-api" --user-facing --criticality high --out slo.yaml # 3. Calculate error budget python scripts/error_budget.py --slo 99.9 --window 30d # 4. Generate Prometheus rules python scripts/gen_prometheus.py --slo slo.yaml --out prometheus-rules.yaml # 5. Generate Grafana dashboard python scripts/gen_grafana.py --slo slo.yaml --out dashboard.json # 6. Generate PagerDuty escalation python scripts/gen_pagerduty.py --slo slo.yaml --out pagerduty.yaml # 7. Generate runbook python scripts/gen_runbook.py --slo slo.yaml --service "checkout-api" --out runbook.md # 8. Monthly review template python scripts/review.py --slo slo.yaml --month 2026-06 --out review-2026-06.md ``` ## The 8 SLO frameworks (which to use) | # | Service type | Primary SLO | Secondary SLO | |---|--------------|-------------|---------------| | 1 | **User-facing API** | Availability (99.9%) | Latency p99 (300ms) | | 2 | **Read-heavy DB** | Query success rate | Read latency p95 | | 3 | **Write-heavy DB** | Durability (zero loss) | Write latency p99 | | 4 | **Async worker** | Throughput (events/min) | End-to-end latency p99 | | 5 | **Batch job** | Job success rate | Duration p95 | | 6 | **CDN** | Cache hit rate | Edge latency p95 | | 7 | **Auth service** | Auth success rate | Token issuance latency | | 8 | **Internal API** | Request success rate | Internal latency p95 | ## The 12 SLIs (catalog) ### Availability (3) - `http_requests_total{status!~"5.."} / http_requests_total` - `uptime_seconds / total_seconds` - `successful_requests / total_requests` ### Latency (4) - `histogram_quantile(0.99, ...)` (p99) - `histogram_quantile(0.95, ...)` (p95) - `histogram_quantile(0.50, ...)` (p50) - `requests under SLO / total requests` (good / total) ### Throughput (2) - `events_processed_total / time_window` - `bytes_processed_total / time_window` ### Durability (2) - `data_loss_bytes / total_bytes_stored` - `successful_writes / total_write_attempts` ### Correctness (1) - `correct_responses / total_responses` (where "correct" is defined per service) ## Burn-rate alerts (Google SRE model) For a 30-day SLO with 99.9% target (43.2 min budget): | Alert | Window | Burn rate | Severity | Page? | |-------|--------|-----------|----------|-------| | **Fast burn (1h)** | 1h | 14.4x | Page on-call | Yes | | **Fast burn (6h)** | 6h | 6x | Page on-call | Yes | | **Slow burn (24h)** | 24h | 3x | Slack | No | | **Slow burn (3d)** | 3d | 1x | Slack | No | The math: at 14.4x burn for 1h, you'd consume 1% of the 30-day budget. 14.4 * 1h / 720h = 2%. So this alert fires when 2% of budget is consumed in 1h. ## Sample output (Prometheus rules) ```yaml groups: - name: checkout-api-slo interval: 30s rules: # Recording: success ratio - record: slo:service_success_ratio:5m expr: | sum(rate(http_requests_total{service="checkout-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="checkout-api"}[5m])) # Recording: error budget remaining - record: slo:error_budget_remaining:30d expr: | 1 - ( (1 - avg_over_time(slo:service_success_ratio:5m[30d])) / (1 - 0.999) ) # Fast burn (1h, 14.4x) - alert: SLO_FastBurn_1h expr: | (1 - slo:service_success_ratio:5m) > (1 - 0.999) * 14.4 for: 2m labels: severity: critical slo: checkout-api annotations: summary: "checkout-api SLO burning fast (1h)" runbook: "https://wiki/runbooks/checkout-api-slo" # Slow burn (24h, 3x) - alert: SLO_SlowBurn_24h expr: | (1 - avg_over_time(slo:service_success_ratio:5m[24h])) > (1 - 0.999) * 3 for: 5m labels: severity: warning annotations: summary: "checkout-api SLO burning slow (24h)" ``` ## Pricing Single-purchase, lifetime access. $9.00. Includes: - 7 Python scripts (design, budget, prometheus, grafana, pagerduty, runbook, review) - 4 reference docs (8 frameworks, 12 SLIs, burn rate, SRE book notes) - 5 templates (SLO spec, Prometheus rules, Grafana dashboard, PagerDuty, runbook) - Future updates for the same major version ## Example usage > "We just launched a new checkout API. Define SLOs, generate Prometheus rules, and a Grafana dashboard for the team." The skill will: 1. Ask 5-6 questions (user-facing? criticality? tier?) 2. Recommend SLOs (e.g. 99.9% availability, p99 < 300ms) 3. Calculate error budget (43.2 min/month) 4. Generate Prometheus rules (recording + burn-rate alerts) 5. Generate Grafana dashboard JSON 6. Generate runbook 7. Output `slo-package/` ready to commit ## Compatibility Works with any agent that supports the SKILL.md standard and can execute Python: Claude Code, OpenClaw, Codex CLI, Cursor, Gemini CLI, Cline, Windsurf, Aider. Tested on Linux, macOS, Windows. Outputs standard Prometheus + Grafana formats; PagerDuty YAML imports directly. ## Tags observability, slo, sli, sre, prometheus, grafana, monitoring, devops, reliability

Use Cases

Stop fighting fires without measuring. Define SLOs/SLIs, calculate error budgets, generate Prometheus rules + Grafana dashboards + PagerDuty alerts + runbook templates. With 8-framework SLO selection guide and burn-rate alerts.

How to Install

mkdir -p ~/.claude/skills && curl -sL https://www.agensi.io/api/install/observability-slo-architect -o /tmp/observability-slo-architect.zip && unzip -o /tmp/observability-slo-architect.zip -d ~/.claude/skills && rm /tmp/observability-slo-architect.zip

Free skills install directly. Paid skills require purchase - use the download button above after buying.

Reviews

No reviews yet - be the first to share your experience.

Only users who have downloaded or purchased this skill can leave a review.

No reviews yet - be the first to share your experience.

Only users who have downloaded or purchased this skill can leave a review.

Security Scanned

Passed automated security review

Permissions

Terminal / Shell

Allowed Hosts

wiki

File Scopes

scripts/**

Creator

Kaymue

Frequently Asked Questions

Learn More About AI Agent Skills

More Premium Skills

Multi-Agent Orchestration Master Library

Transform Claude Code into a coordinated multi-agent system. Battle-tested tmux orchestration patterns, YAML task queues, event-driven communication, and parallel worker management for 8+ agents.

$358 installs

designing-hybrid-context-layers

Architects the right retrieval strategy for every query — teaching your agent when to use RAG, a knowledge graph, or a temporal index instead of defaulting to vector search for everything.

$1016 installs

ai-automation-qa-pack

Professional QA & UAT documentation generator for AI automation agencies and complex agent deployments.

$510 installs

Bounty Security Pattern Master Library — 399 Vulnerability Patterns

A premium library of 399 vulnerability patterns and DeFi attack vectors for AI-driven bug hunting and security audits.

$759 installs