Error Budgets
SLI/SLO definitions and tracking for Hummingbird.
Terminology
- SLI (Service Level Indicator): A quantitative measure of some aspect of the service. For example, latency from CVE fix availability to delivery.
- SLO (Service Level Objective): A target value for an SLI. For example, “less than 24 hours”.
- Error budget: The amount of allowed failure:
1 - SLO target. An SLO target of 80% gives an error budget of 20%. - Burn rate: How quickly the error budget is consumed. A burn rate of 1 means the budget is consumed evenly over the reporting window; 2 means twice as fast.
For background on multi-window burn-rate alerting, see the Google SRE workbook.
UE1: Updated images include all upstream CVE fixes
Updated images are available for download that include all CVE fixes present upstream.
SLI-UE1-A: CVE Fix Cycle Time
The latency (cycle time) from a CVE being first detected on a non-superseded image tag to being remediated (image rebuilt without the CVE).
| Percentile | SLO | Alarm |
|---|---|---|
| p80 | 24h | Yes (Hummingbird only; Rawhide informational) |
| p95 | 3d | Dashboard only |
| p99 | 7d | Dashboard only |
Metric
The container-catalog metrics Lambda
emits the CveExposureDuration distribution metric to CloudWatch. Each data
point is the number of hours since first_seen for an active CVE on a
non-superseded canonical tag. The SLO tracks all CVEs regardless of fix
availability.
| Metric | Type | Dimensions |
|---|---|---|
CveExposureDuration |
Distribution | [Distro], [Distro, Severity] |
ActiveCveCount |
Count | [Distro, Severity] |
Dashboard
A per-distro {prefix} CloudWatch dashboard shows:
- SLO gauge: p80/p95/p99
CveExposureDurationvs per-percentile thresholds - Severity breakdown: CVE counts by severity (single-value)
- Trends: CVE count over time by severity
- Open CVE inventory: Logs Insights table of all active CVEs
Alarm
A CloudWatch Alarm fires when p80 CveExposureDuration exceeds 24 hours over
a 1-hour evaluation period, publishing to the {prefix}-slo-alarm SNS topic.
The alarm is only deployed when AlertingEnabled=true (Hummingbird); Rawhide
dashboards are informational only. When an AlertEmail is configured, an SNS
email subscription is created (the recipient must confirm via email).
Open CVE Inventory
The structured logs emitted by the metrics Lambda can be queried via CloudWatch Logs Insights to get a live inventory of all open CVEs:
filter message = "active_cve"
| sort @timestamp desc
| dedup cve, repository, stream, variant
| fields cve, severity, exposure_hours, repository, stream, variant, components, run_ts
| sort exposure_hours desc
Infrastructure
The dashboard and alarm live in the error-budgets SAM stack, deployed per-distro. The stack consumes CloudWatch metrics and log groups produced by the container-catalog stack.