Error Budgets

SLI/SLO definitions and tracking for Hummingbird.

Terminology

  • SLI (Service Level Indicator): A quantitative measure of some aspect of the service. For example, latency from CVE fix availability to delivery.
  • SLO (Service Level Objective): A target value for an SLI. For example, “less than 24 hours”.
  • Error budget: The amount of allowed failure: 1 - SLO target. An SLO target of 80% gives an error budget of 20%.
  • Burn rate: How quickly the error budget is consumed. A burn rate of 1 means the budget is consumed evenly over the reporting window; 2 means twice as fast.

For background on multi-window burn-rate alerting, see the Google SRE workbook.

UE1: Updated images include all upstream CVE fixes

Updated images are available for download that include all CVE fixes present upstream.

SLI-UE1-A: CVE Fix Cycle Time

The latency (cycle time) from a CVE fix being available upstream to being delivered in a Hummingbird image.

Percentile SLO Alarm
p80 24h Yes (Hummingbird only; Rawhide informational)
p95 3d Dashboard only
p99 7d Dashboard only

Metric

The container-catalog metrics Lambda emits the CveExposureDuration distribution metric to CloudWatch. Each data point is the number of hours since first_seen for an active CVE on a non-superseded canonical tag. Metrics include a Fixable dimension (true or false) based on whether a fix version is known upstream. The SLO applies to fixable CVEs only; non-fixable CVEs are tracked for visibility.

Metric Type Dimensions
CveExposureDuration Distribution [Distro, Fixable], [Distro, Sev, Fix]
ActiveCveCount Count [Distro, Severity, Fixable]

Dashboard

A per-distro {prefix} CloudWatch dashboard shows:

  • SLO gauge: p80/p95/p99 fixable CveExposureDuration vs per-percentile thresholds
  • Severity breakdown: Fixable CVE counts by severity (single-value)
  • Trends: Fixable CVE count over time by severity
  • Open fixable CVE inventory: Logs Insights table of fixable CVEs with fix versions

Alarm

A CloudWatch Alarm fires when p80 fixable CveExposureDuration (Fixable=true) exceeds 24 hours over a 1-hour evaluation period, publishing to the {prefix}-slo-alarm SNS topic. The alarm is only deployed when AlertingEnabled=true (Hummingbird); Rawhide dashboards are informational only. When an AlertEmail is configured, an SNS email subscription is created (the recipient must confirm via email).

Open CVE Inventory

The structured logs emitted by the metrics Lambda can be queried via CloudWatch Logs Insights to get a live inventory of all open CVEs:

filter message = "active_cve" and fixable = true
| fields cve, severity, exposure_hours, fixed_in, repository, stream, variant, component
| sort exposure_hours desc

Infrastructure

The dashboard and alarm live in the error-budgets SAM stack, deployed per-distro. The stack consumes CloudWatch metrics and log groups produced by the container-catalog stack.