Error Budgets

SLI/SLO definitions and tracking for Hummingbird.

Terminology

  • SLI (Service Level Indicator): A quantitative measure of some aspect of the service. For example, latency from CVE fix availability to delivery.
  • SLO (Service Level Objective): A target value for an SLI. For example, “less than 24 hours”.
  • Error budget: The amount of allowed failure: 1 - SLO target. An SLO target of 80% gives an error budget of 20%.
  • Burn rate: How quickly the error budget is consumed. A burn rate of 1 means the budget is consumed evenly over the reporting window; 2 means twice as fast.

For background on multi-window burn-rate alerting, see the Google SRE workbook.

UE1: Updated images include all upstream CVE fixes

Updated images are available for download that include all CVE fixes present upstream.

SLI-UE1-A: CVE Fix Cycle Time

The latency (cycle time) from a CVE being first detected on a non-superseded image tag to being remediated (image rebuilt without the CVE).

Percentile SLO Alarm
p80 24h Yes (Hummingbird only; Rawhide informational)
p95 3d Dashboard only
p99 7d Dashboard only

Metric

The container-catalog metrics Lambda emits the CveExposureDuration distribution metric to CloudWatch. Each data point is the number of hours since first_seen for an active CVE on a non-superseded canonical tag. The SLO tracks all CVEs regardless of fix availability.

Metric Type Dimensions
CveExposureDuration Distribution [Distro], [Distro, Severity]
ActiveCveCount Count [Distro, Severity]

Dashboard

A per-distro {prefix} CloudWatch dashboard shows:

  • SLO gauge: p80/p95/p99 CveExposureDuration vs per-percentile thresholds
  • Severity breakdown: CVE counts by severity (single-value)
  • Trends: CVE count over time by severity
  • Open CVE inventory: Logs Insights table of all active CVEs

Alarm

A CloudWatch Alarm fires when p80 CveExposureDuration exceeds 24 hours over a 1-hour evaluation period, publishing to the {prefix}-slo-alarm SNS topic. The alarm is only deployed when AlertingEnabled=true (Hummingbird); Rawhide dashboards are informational only. When an AlertEmail is configured, an SNS email subscription is created (the recipient must confirm via email).

Open CVE Inventory

The structured logs emitted by the metrics Lambda can be queried via CloudWatch Logs Insights to get a live inventory of all open CVEs:

filter message = "active_cve"
| sort @timestamp desc
| dedup cve, repository, stream, variant
| fields cve, severity, exposure_hours, repository, stream, variant, components, run_ts
| sort exposure_hours desc

Infrastructure

The dashboard and alarm live in the error-budgets SAM stack, deployed per-distro. The stack consumes CloudWatch metrics and log groups produced by the container-catalog stack.