About This Skill
# Observability SLO Architect
You don't have a reliability problem, you have a measurement problem. This skill turns "is the service up?" into quantifiable SLOs, error budgets, and alerts that wake the right people at the right time.
## What it does
End-to-end SLO engineering:
- **SLI selection** — 12 indicators across availability, latency, throughput, durability, correctness
- **SLO framework** — pick the right SLO for your service (8 patterns)
- **Error budget calculation** — convert SLO to downtime budget per window
- **Burn-rate alerts** — fast burn (1h, 6h) vs slow burn (24h, 3d)
- **Prometheus rules** — ready-to-use recording + alerting rules
- **Grafana dashboards** — SLO compliance, burn rate, error budget remaining
- **PagerDuty / Opsgenie policies** — escalation + paging based on burn rate
- **Runbook templates** — what to do when SLO is at risk
- **Review process** — monthly SLO review template
## When to use it
- You're launching a new service and need to define SLOs
- Your on-call is exhausted from alerts that don't matter
- Customers complain about reliability but you can't quantify it
- You want to introduce error budgets but don't know how
- Your SLOs are aspirational ("99.9%") but no one's held to them
- You need to migrate from "uptime" to SLO-based thinking
## Why it's better than ad-hoc prompting
Most "define SLOs" prompts give textbook answers. This skill is different:
- **8 specific SLO frameworks** — pick the right one for your service type
- **Ready-to-deploy rules** — Prometheus + Grafana, not pseudocode
- **Burn-rate alerts done right** — Google's 2% / 50% burn model
- **Concrete tradeoffs** — "user-facing latency" vs "internal availability"
- **Templates** — copy-paste runbooks, review docs
## Architecture
```
┌─────────────────────────────────────────────────────────┐
│ Agent (Claude/Cursor) │
│ - Reads service description │
│ - Asks about user journey, criticality │
│ - Generates SLO spec + rules + dashboards │
└───────────────┬─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ skills/observability-slo-architect/ │
│ scripts/ │
│ ├── design_slo.py # SLI selection + SLO targets │
│ ├── error_budget.py # Budget calculation │
│ ├── gen_prometheus.py # Recording + alerting rules │
│ ├── gen_grafana.py # Dashboard JSON │
│ ├── gen_pagerduty.py # Escalation policy │
│ ├── gen_runbook.py # Markdown runbook │
│ └── review.py # Monthly SLO review template │
│ references/ │
│ ├── 8-frameworks.md # Which SLO pattern when │
│ ├── sli-catalog.md # 12 SLI definitions │
│ ├── burn-rate-guide.md # Fast vs slow burn │
│ └── google-sre-book-notes.md │
│ templates/ │
│ ├── slo-spec.yaml │
│ ├── prometheus-rules.yaml │
│ ├── grafana-dashboard.json │
│ ├── pagerduty-policy.yaml │
│ └── runbook.md.tmpl │
└─────────────────────────────────────────────────────────┘
```
## Quick start
```bash
# 1. Install
pip install pyyaml
# 2. Design SLOs for a service
python scripts/design_slo.py --service "checkout-api" --user-facing --criticality high --out slo.yaml
# 3. Calculate error budget
python scripts/error_budget.py --slo 99.9 --window 30d
# 4. Generate Prometheus rules
python scripts/gen_prometheus.py --slo slo.yaml --out prometheus-rules.yaml
# 5. Generate Grafana dashboard
python scripts/gen_grafana.py --slo slo.yaml --out dashboard.json
# 6. Generate PagerDuty escalation
python scripts/gen_pagerduty.py --slo slo.yaml --out pagerduty.yaml
# 7. Generate runbook
python scripts/gen_runbook.py --slo slo.yaml --service "checkout-api" --out runbook.md
# 8. Monthly review template
python scripts/review.py --slo slo.yaml --month 2026-06 --out review-2026-06.md
```
## The 8 SLO frameworks (which to use)
| # | Service type | Primary SLO | Secondary SLO |
|---|--------------|-------------|---------------|
| 1 | **User-facing API** | Availability (99.9%) | Latency p99 (300ms) |
| 2 | **Read-heavy DB** | Query success rate | Read latency p95 |
| 3 | **Write-heavy DB** | Durability (zero loss) | Write latency p99 |
| 4 | **Async worker** | Throughput (events/min) | End-to-end latency p99 |
| 5 | **Batch job** | Job success rate | Duration p95 |
| 6 | **CDN** | Cache hit rate | Edge latency p95 |
| 7 | **Auth service** | Auth success rate | Token issuance latency |
| 8 | **Internal API** | Request success rate | Internal latency p95 |
## The 12 SLIs (catalog)
### Availability (3)
- `http_requests_total{status!~"5.."} / http_requests_total`
- `uptime_seconds / total_seconds`
- `successful_requests / total_requests`
### Latency (4)
- `histogram_quantile(0.99, ...)` (p99)
- `histogram_quantile(0.95, ...)` (p95)
- `histogram_quantile(0.50, ...)` (p50)
- `requests under SLO / total requests` (good / total)
### Throughput (2)
- `events_processed_total / time_window`
- `bytes_processed_total / time_window`
### Durability (2)
- `data_loss_bytes / total_bytes_stored`
- `successful_writes / total_write_attempts`
### Correctness (1)
- `correct_responses / total_responses` (where "correct" is defined per service)
## Burn-rate alerts (Google SRE model)
For a 30-day SLO with 99.9% target (43.2 min budget):
| Alert | Window | Burn rate | Severity | Page? |
|-------|--------|-----------|----------|-------|
| **Fast burn (1h)** | 1h | 14.4x | Page on-call | Yes |
| **Fast burn (6h)** | 6h | 6x | Page on-call | Yes |
| **Slow burn (24h)** | 24h | 3x | Slack | No |
| **Slow burn (3d)** | 3d | 1x | Slack | No |
The math: at 14.4x burn for 1h, you'd consume 1% of the 30-day budget. 14.4 * 1h / 720h = 2%. So this alert fires when 2% of budget is consumed in 1h.
## Sample output (Prometheus rules)
```yaml
groups:
- name: checkout-api-slo
interval: 30s
rules:
# Recording: success ratio
- record: slo:service_success_ratio:5m
expr: |
sum(rate(http_requests_total{service="checkout-api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout-api"}[5m]))
# Recording: error budget remaining
- record: slo:error_budget_remaining:30d
expr: |
1 - (
(1 - avg_over_time(slo:service_success_ratio:5m[30d]))
/ (1 - 0.999)
)
# Fast burn (1h, 14.4x)
- alert: SLO_FastBurn_1h
expr: |
(1 - slo:service_success_ratio:5m)
> (1 - 0.999) * 14.4
for: 2m
labels:
severity: critical
slo: checkout-api
annotations:
summary: "checkout-api SLO burning fast (1h)"
runbook: "https://wiki/runbooks/checkout-api-slo"
# Slow burn (24h, 3x)
- alert: SLO_SlowBurn_24h
expr: |
(1 - avg_over_time(slo:service_success_ratio:5m[24h]))
> (1 - 0.999) * 3
for: 5m
labels:
severity: warning
annotations:
summary: "checkout-api SLO burning slow (24h)"
```
## Pricing
Single-purchase, lifetime access. $9.00.
Includes:
- 7 Python scripts (design, budget, prometheus, grafana, pagerduty, runbook, review)
- 4 reference docs (8 frameworks, 12 SLIs, burn rate, SRE book notes)
- 5 templates (SLO spec, Prometheus rules, Grafana dashboard, PagerDuty, runbook)
- Future updates for the same major version
## Example usage
> "We just launched a new checkout API. Define SLOs, generate Prometheus rules, and a Grafana dashboard for the team."
The skill will:
1. Ask 5-6 questions (user-facing? criticality? tier?)
2. Recommend SLOs (e.g. 99.9% availability, p99 < 300ms)
3. Calculate error budget (43.2 min/month)
4. Generate Prometheus rules (recording + burn-rate alerts)
5. Generate Grafana dashboard JSON
6. Generate runbook
7. Output `slo-package/` ready to commit
## Compatibility
Works with any agent that supports the SKILL.md standard and can execute Python: Claude Code, OpenClaw, Codex CLI, Cursor, Gemini CLI, Cline, Windsurf, Aider. Tested on Linux, macOS, Windows. Outputs standard Prometheus + Grafana formats; PagerDuty YAML imports directly.
## Tags
observability, slo, sli, sre, prometheus, grafana, monitoring, devops, reliability