Error Budgets
SLI/SLO definitions and tracking for Hummingbird.
Terminology
- SLI (Service Level Indicator): A quantitative measure of some aspect of the service. For example, latency from CVE fix availability to delivery.
- SLO (Service Level Objective): A target value for an SLI. For example, “less than 24 hours”.
- Error budget: The amount of allowed failure:
1 - SLO target. An SLO target of 80% gives an error budget of 20%. - Burn rate: How quickly the error budget is consumed. A burn rate of 1 means the budget is consumed evenly over the reporting window; 2 means twice as fast.
For background on multi-window burn-rate alerting, see the Google SRE workbook.
UE1: Updated images include all upstream CVE fixes
Updated images are available for download that include all CVE fixes present upstream.
SLI-UE1-A: CVE Fix Cycle Time
The latency (cycle time) from a CVE fix being available upstream to being delivered in a Hummingbird image.
| Percentile | SLO | Alarm |
|---|---|---|
| p80 | 24h | Yes (Hummingbird only; Rawhide informational) |
| p95 | 3d | Dashboard only |
| p99 | 7d | Dashboard only |
Metric
The container-catalog metrics Lambda
emits the CveExposureDuration distribution metric to CloudWatch. Each data
point is the number of hours since first_seen for an active CVE on a
non-superseded canonical tag. Metrics include a Fixable dimension (true or
false) based on whether a fix version is known upstream. The SLO applies to
fixable CVEs only; non-fixable CVEs are tracked for visibility.
| Metric | Type | Dimensions |
|---|---|---|
CveExposureDuration |
Distribution | [Distro, Fixable], [Distro, Sev, Fix] |
ActiveCveCount |
Count | [Distro, Severity, Fixable] |
Dashboard
A per-distro {prefix} CloudWatch dashboard shows:
- SLO gauge: p80/p95/p99 fixable
CveExposureDurationvs per-percentile thresholds - Severity breakdown: Fixable CVE counts by severity (single-value)
- Trends: Fixable CVE count over time by severity
- Open fixable CVE inventory: Logs Insights table of fixable CVEs with fix versions
Alarm
A CloudWatch Alarm fires when p80 fixable CveExposureDuration (Fixable=true)
exceeds 24 hours over a 1-hour evaluation period, publishing to the
{prefix}-slo-alarm SNS topic.
The alarm is only deployed when AlertingEnabled=true (Hummingbird); Rawhide
dashboards are informational only. When an AlertEmail is configured, an SNS
email subscription is created (the recipient must confirm via email).
Open CVE Inventory
The structured logs emitted by the metrics Lambda can be queried via CloudWatch Logs Insights to get a live inventory of all open CVEs:
filter message = "active_cve" and fixable = true
| fields cve, severity, exposure_hours, fixed_in, repository, stream, variant, component
| sort exposure_hours desc
Infrastructure
The dashboard and alarm live in the error-budgets SAM stack, deployed per-distro. The stack consumes CloudWatch metrics and log groups produced by the container-catalog stack.