stable URL: /l/hummingbird-monitoring

Hummingbird Monitoring

Shared observability resources for all Hummingbird AWS stacks. Provides a central SNS alerts topic, PromQL-based CloudWatch alarms, operational dashboards for metrics ingested via the OTLP endpoint, and IAM resources for the OpenTelemetry Collector.

This stack handles infrastructure alerting (Lambda errors, DLQ depth, API 5xx). For SLO alerting (CVE exposure burn-rate), see Error Budgets.

Features

Central Alerts Topic: Single SNS topic shared by all Hummingbird stacks
PromQL Alarms: 14 alarms evaluating Prometheus metrics in CloudWatch (credential expiry, service health, K8s workload, agent health)
Operational Dashboards: 4 dashboards visualizing OTLP metrics via PromQL (service health, agent operations, K8s health, credential expiry)
Email Subscription: Optional email subscription with automatic confirmation workflow
Opt-in per Stack: Consumer stacks receive the topic ARN as a parameter; alarms are wired only when the ARN is non-empty

Architecture

Metrics flow from Kubernetes services to CloudWatch via an OpenTelemetry Collector, where PromQL alarms and dashboards evaluate them:

flowchart TD
  subgraph k8s [OCP Clusters]
    dashboard["hummingbird-dashboard\n/metrics :8080"]
    status["hummingbird-status worker\n/metrics :9090"]
    agent["hummingbird-agent\n/metrics :9090"]
    ksm["kube-state-metrics\n/metrics :8080"]
    creds["credential-metrics sidecar\n/credentials.prom :8080"]
    otel["OTel Collector\nprometheus receiver\n+ sigv4auth"]
  end

  subgraph aws [AWS us-east-1]
    cwOtlp["CloudWatch OTLP Endpoint"]
    cwAlarm["PromQL Alarms"]
    cwDash["PromQL Dashboards"]
    sns["SNS Topic"]
    email["email alert group"]
  end

  dashboard --> otel
  status --> otel
  agent --> otel
  ksm --> otel
  creds --> otel
  otel -->|"OTLP/HTTP + SigV4"| cwOtlp
  cwOtlp --> cwAlarm
  cwOtlp --> cwDash
  cwAlarm --> sns
  sns --> email

Key properties:

Pull-based: the OTel Collector scrapes Prometheus /metrics endpoints every 60 seconds using Kubernetes service discovery
Services opt in via prometheus.io/* annotations on their K8s Service
Metrics retain their original Prometheus names in CloudWatch (no normalization)
Each cluster adds a cluster external label so metrics are distinguishable
All 6 OCP clusters run the collector; credential metrics only emit from mpp-prod (data is repo-wide)

Metrics Pipeline

Application Metrics

Source	Metrics	Port	Key series
hummingbird-dashboard	HTTP requests, latency, DB pool	8080	`http_requests_total`, `http_request_duration_seconds`, `db_pool_connections`
hummingbird-status worker	SQS processing, ingestion	9090	`sqs_messages_processed_total`, `ingest_records_total`
hummingbird-agent	LLM cost/tokens, workflows, events, SQS	9090	`hummingbird_agent_llm_`, `hummingbird_agent_workflow_`, `hummingbird_agent_events_total`

Infrastructure Metrics

Source	Metrics	Port	Key series
kube-state-metrics	K8s object state	8080	`kube_deployment_`, `kube_pod_`, `kube_statefulset_`, `kube_job_`
credential-metrics sidecar	Token expiry/age	8080	`cki_token_expires_at`, `cki_token_created_at`

On-Cluster Components

OTel Collector (kubernetes/otel-collector/ in infrastructure repo):

Contrib distribution (ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib)
Uses prometheus receiver with kubernetes_sd_configs (role: endpoints) for annotation-based auto-discovery across all Hummingbird namespaces
sigv4auth extension for IAM authentication to CloudWatch OTLP
metricstarttimeprocessor (true_reset_point) to satisfy CloudWatch’s StartTimeUnixNano requirement
metric_relabel_configs drops *_created gauges (invalid OTLP from prometheus_client)

kube-state-metrics (kubernetes/kube-state-metrics/ in infrastructure repo):

Image: quay.io/cki/mirror/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.17.0
Monitors all non-config Hummingbird namespaces (hummingbird--internal, hummingbird--runner, hummingbird--agent-sandbox, hummingbird--factory-sandbox, hummingbird--k8s-test, hummingbird--runner-jobs)
Scoped via --namespaces flag and per-namespace RBAC (Role + RoleBinding)
--metric-allowlist limits cardinality to deployment, pod, statefulset, job, and container status metrics

Credential metrics sidecar (nginx container in OTel Collector pod):

Serves a static Prometheus text file from a ConfigMap volume
ConfigMap populated by credential-metrics/deploy.sh CI job on merge to main (mpp-prod only)
Generated from secrets.yml metadata via cki_tools.credentials.metrics
Volume uses optional: true so the collector starts even without the ConfigMap

Prerequisites

AWS CLI configured with appropriate credentials (IAM permissions for SNS, CloudFormation)
Podman or Docker (for containerized SAM build/deploy)

Deployment

Build and deploy using containerized AWS SAM CLI:

cd hummingbird-monitoring
make build     # Build SAM application
make deploy    # First deployment (interactive/guided)
make redeploy  # Subsequent deployments (non-interactive)

Deployment outputs:

AlertsTopicArn - SNS topic ARN (for consumer stacks)
AlertsTopicName - SNS topic name
ServiceHealthDashboardUrl - Service Health dashboard URL
AgentOperationsDashboardUrl - Agent Operations dashboard URL
K8sHealthDashboardUrl - Kubernetes Health dashboard URL
CredentialExpiryDashboardUrl - Credential Expiry dashboard URL

Parameters

Parameter	Description	Default
`TopicName`	SNS topic name	`myapp-prod-alerts`
`AlertEmail`	Email for notifications (empty=skip)	`""`

When AlertEmail is non-empty, an SNS email subscription is created. The recipient must confirm via a link in the confirmation email before notifications start flowing.

PromQL Alarms

PromQL alarms evaluate metrics ingested via the CloudWatch OTLP endpoint (pushed by the OTel Collector running in Kubernetes). They use duration-based pending/recovery periods rather than traditional datapoint evaluation.

Credential Expiry

Alarm	Condition	Eval Interval
`token-expiry`	Active token expires within 28 days	1 h
`token-age`	Active token older than 337 days	1 h

Service Health

Alarm	Condition	Eval Interval
`service-down`	Scrape target down for 5 min	60 s
`high-error-rate`	HTTP 5xx rate > 5% for 5 min	60 s
`sqs-processing-errors`	SQS processing errors for 5 min	60 s

Kubernetes Workload Health

Alarm	Condition	Eval Interval
`deployment-not-ready`	Zero ready replicas for 10 min	60 s
`statefulset-not-ready`	Zero ready replicas for 10 min	60 s
`pod-crash-looping`	CrashLoopBackOff for 15 min	60 s
`pod-oom-killed`	OOM-killed with restarts in 15 min	60 s
`image-pull-failed`	ImagePullBackOff for 10 min	60 s
`cronjob-failed`	CronJob-owned Job failed	5 min

Agent Health

Alarm	Condition	Eval Interval
`agent-high-llm-error-rate`	LLM error rate > 10% over 15 min	60 s
`agent-high-cost`	Projected daily LLM spend > $50	1 h
`agent-event-backlog`	Event processing errors for 5 min	60 s

All alarms notify the shared SNS alerts topic on both ALARM and OK transitions.

Dashboards

Four CloudWatch dashboards provide at-a-glance operational visibility using PromQL queries against OTLP-ingested metrics. Dashboards refresh automatically and cost $3/dashboard/month ($12/month total).

Dashboard	Purpose
`${stack}-service-health`	Request rates, error rates, latency, targets
`${stack}-agent-operations`	LLM cost, tokens, workflows, events, errors
`${stack}-k8s-health`	Deployments, StatefulSets, restarts, CronJobs
`${stack}-credential-expiry`	Token expiry timelines and age tracking

Service Health Dashboard

HTTP request rate by service
5xx error rate by service
Request latency percentiles (p50/p95/p99)
Scrape target up/down status
SQS processing rate by status
Database connection pool usage

Agent Operations Dashboard

LLM cost rate by model ($/hr)
Token usage by model and direction (tokens/hr)
Workflow duration p95 by workflow type
Event handling rate by action
LLM request error rate
Sandbox acquire latency p95

Kubernetes Health Dashboard

Deployment ready replica counts
StatefulSet ready replica counts
Pod restart rate per hour
Containers in waiting state (by reason)
Failed CronJobs

Credential Expiry Dashboard

Days until nearest token expiry (minimum)
Count of tokens expiring within 28 days
Oldest token age
Per-token expiry timeline
Per-token age timeline

Operations

Adding a New Scrape Target

To have the OTel Collector scrape a new service:

Ensure the application exposes a Prometheus /metrics endpoint

Add annotations to the K8s Service (in the infrastructure repo):

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"

Deploy — the collector auto-discovers annotated services via kubernetes_sd_configs

Adding a New PromQL Alarm

Add an AWS::CloudWatch::Alarm resource to hummingbird-monitoring/template.yaml:

MyNewAlarm:
  Type: AWS::CloudWatch::Alarm
  Metadata:
    cfn-lint:
      config:
        ignore_checks: [E3002, E3014]
  Properties:
    AlarmName: !Sub ${AWS::StackName}-my-new-alarm
    AlarmDescription: Description of what this detects
    ActionsEnabled: true
    AlarmActions: [!Ref AlertsTopic]
    OKActions: [!Ref AlertsTopic]
    EvaluationCriteria:
      PromQLCriteria:
        PromQLExpression: |
          some_promql_expression > threshold
        EvaluationInterval: 60
        AlertOnMissing: false
        PendingDuration: 300
        RecoveryDuration: 300

The cfn-lint metadata ignore is required because cfn-lint does not yet recognize EvaluationCriteria.PromQLCriteria (CloudWatch preview feature).

Dashboard widgets use the undocumented "type": "chart" format with PromQL:

{
  "type": "chart",
  "x": 0, "y": 0, "width": 12, "height": 6,
  "properties": {
    "title": "Widget Title",
    "view": "line",
    "region": "us-east-1",
    "data": {
      "queries": [{
        "id": "q1",
        "type": "cloudwatch-metrics",
        "language": "PromQL",
        "query": "sum(rate(my_metric_total[5m])) by (label)",
        "step": 60
      }]
    }
  }
}

Troubleshooting

Check target health: Query up in CloudWatch Query Studio — each scraped service appears as a time series with value 1 (up) or 0 (down)
Check collector logs: oc logs deployment/otel-collector -n hummingbird--internal — look for exporter errors or scrape failures
Verify metric names: Use CloudWatch Query Studio with a simple PromQL query like {__name__=~"http.*"} to discover available metrics
Stale credential metrics: Re-run the credential_metrics CI job (or merge any change to secrets.yml) to regenerate the ConfigMap

Consumer Stacks

Each consumer stack accepts an AlertTopicArn parameter. When non-empty, CloudWatch alarms publish to the shared topic on state transitions. When empty (the default), alarms still fire but do not send notifications.

Stack	Alarms
`container-catalog`	Lambda errors, API 5xx, DLQ depth, metrics iterator age
`hummingbird-agent`	DLQ depth (events, work)
`hummingbird-status`	DLQ depth (events)

Wiring Pattern

Consumer stacks use this pattern to conditionally wire alarms:

Parameters:
  AlertTopicArn:
    Type: String
    Default: ""

Conditions:
  HasAlertTopicArn: !Not [!Equals [!Ref AlertTopicArn, ""]]

Resources:
  MyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      # ... alarm configuration ...
      AlarmActions: !If [HasAlertTopicArn, [!Ref AlertTopicArn], !Ref "AWS::NoValue"]
      OKActions: !If [HasAlertTopicArn, [!Ref AlertTopicArn], !Ref "AWS::NoValue"]

Adding a New Alarm

Define the AWS::CloudWatch::Alarm resource in the stack’s template.yaml with AlarmActions / OKActions using the pattern above.
The stack must already have the AlertTopicArn parameter and HasAlertTopicArn condition (all three consumer stacks already do).

Adding a New Consumer Stack

Add AlertTopicArn parameter and HasAlertTopicArn condition to the stack’s template.yaml.
Add ALERT_TOPIC_ARN to the stack’s vars.sh in the infrastructure repo.
Add AlertTopicArn to parameter_overrides in the stack’s samconfig.toml.j2.

Decision Records

DR-1: CloudWatch OTLP over self-hosted Prometheus or AMP

We chose the CloudWatch OTLP endpoint (public preview, April 2026) over deploying a full Prometheus + AlertManager + Grafana stack or Amazon Managed Prometheus (AMP). This unifies all alerting — existing CloudWatch Alarms (Lambda errors, DLQ depth, SLO) and new PromQL alarms — under one system with one notification path (SNS). No new managed services to provision. The PromQL subset is sufficient for our alert rules. Preview status is acceptable at our scale (~400-600 series, 6 clusters): if the preview breaks, existing alarms are unaffected and the fallback path (switch to AMP) only requires changing the OTel Collector exporter config.

DR-2: OTel Collector over CloudWatch Agent or Prometheus agent

The OTel Collector (contrib distribution) is the only tool that natively bridges Prometheus scraping to CloudWatch OTLP in a single binary. The CloudWatch Agent drops histograms and is limited to 30 dimensions. A Prometheus agent requires AMP as intermediary (remote_write targets AMP, not CloudWatch OTLP). The OTel Collector handles all scrape targets uniformly and its exporter is the only config that changes if we switch backends.

DR-3: kube-state-metrics over OTel k8s_cluster receiver

kube-state-metrics (KSM) exposes kube_* Prometheus metrics — the industry standard used by every community alert rule (kubernetes-mixin, CKI). The OTel Collector k8s_cluster receiver emits different metric names (k8s.deployment.available etc.), has documented parity gaps, and is still beta. Using KSM means all PromQL alert rules from the community work without translation.

DR-4: Credential metrics via ConfigMap and sidecar

Credential metadata (expiry dates, creation dates from secrets.yml) rarely changes but PromQL requires recent data points within its lookback window. A CI-direct push approach would need hourly scheduled pipelines for data that changes a few times per year. Instead, CI generates a Prometheus text file on merge-to-main and applies it as a K8s ConfigMap. A nginx sidecar serves it over HTTP. The collector scrapes every 60 seconds — eliminating staleness without scheduled pipelines. The ConfigMap contains only metadata (names, timestamps), not secret values.

DR-5: CloudWatch dashboards over Grafana

CloudWatch dashboards are native to AWS, require zero new infrastructure, and are defined as AWS::CloudWatch::Dashboard resources in SAM alongside alarms. This aligns with the “minimize new infrastructure” principle. Grafana (AMG or self-hosted) is documented as an upgrade path — the OTel Collector architecture does not change, only the query consumer differs.

DR-6: Prometheus client library over OTel SDK

All Hummingbird services use prometheus_client (Python) for instrumentation, matching the existing dashboard and status worker pattern. The OTel Collector already scrapes Prometheus endpoints natively. Migration to the OTel Python SDK is mechanical (Counter to Counter, Gauge to ObservableGauge, Histogram to Histogram) and can happen per-service when desired. The collector supports both prometheus and otlp receivers, so the architecture does not change either way.

Future Upgrades

Grafana: If richer visualization is needed, Grafana can use CloudWatch’s PromQL endpoint as a Prometheus datasource with SigV4 authentication. The OTel Collector architecture stays the same — only the dashboard platform changes. Dashboard JSON definitions can be version-controlled in the repo.

OTel SDK migration: New services can use the OpenTelemetry Python SDK (opentelemetry-instrumentation-fastapi) for vendor-neutral instrumentation and trace-metric correlation. Existing prometheus_client services continue working unchanged.

Post-preview pricing: CloudWatch OTLP is free during preview. Post-GA pricing is TBD. With ~400-600 series at 60s scrape (~17-26M samples/month), estimated cost at AMP-equivalent rates would be ~$50-80/month. The KSM --metric-allowlist keeps cardinality bounded.

Development

This is a pure infrastructure project (no application code). See the main README for SAM build/deploy commands.

License

This project is licensed under the GNU General Public License v3.0 or later - see the LICENSE file for details.