Hummingbird Monitoring

Shared observability resources for all Hummingbird AWS stacks. Provides a central SNS alerts topic, PromQL-based CloudWatch alarms, operational dashboards for metrics ingested via the OTLP endpoint, and IAM resources for the OpenTelemetry Collector.

This stack handles infrastructure alerting (Lambda errors, DLQ depth, API 5xx). For SLO alerting (CVE exposure burn-rate), see Error Budgets.

Features

  • Central Alerts Topic: Single SNS topic shared by all Hummingbird stacks
  • PromQL Alarms: 14 alarms evaluating Prometheus metrics in CloudWatch (credential expiry, service health, K8s workload, agent health)
  • Operational Dashboards: 4 dashboards visualizing OTLP metrics via PromQL (service health, agent operations, K8s health, credential expiry)
  • Email Subscription: Optional email subscription with automatic confirmation workflow
  • Opt-in per Stack: Consumer stacks receive the topic ARN as a parameter; alarms are wired only when the ARN is non-empty

Architecture

Metrics flow from Kubernetes services to CloudWatch via an OpenTelemetry Collector, where PromQL alarms and dashboards evaluate them:

flowchart TD
  subgraph k8s [OCP Clusters]
    dashboard["hummingbird-dashboard\n/metrics :8080"]
    status["hummingbird-status worker\n/metrics :9090"]
    agent["hummingbird-agent\n/metrics :9090"]
    ksm["kube-state-metrics\n/metrics :8080"]
    creds["credential-metrics sidecar\n/credentials.prom :8080"]
    otel["OTel Collector\nprometheus receiver\n+ sigv4auth"]
  end

  subgraph aws [AWS us-east-1]
    cwOtlp["CloudWatch OTLP Endpoint"]
    cwAlarm["PromQL Alarms"]
    cwDash["PromQL Dashboards"]
    sns["SNS Topic"]
    email["email alert group"]
  end

  dashboard --> otel
  status --> otel
  agent --> otel
  ksm --> otel
  creds --> otel
  otel -->|"OTLP/HTTP + SigV4"| cwOtlp
  cwOtlp --> cwAlarm
  cwOtlp --> cwDash
  cwAlarm --> sns
  sns --> email

Key properties:

  • Pull-based: the OTel Collector scrapes Prometheus /metrics endpoints every 60 seconds using Kubernetes service discovery
  • Services opt in via prometheus.io/* annotations on their K8s Service
  • Metrics retain their original Prometheus names in CloudWatch (no normalization)
  • Each cluster adds a cluster external label so metrics are distinguishable
  • All 6 OCP clusters run the collector; credential metrics only emit from mpp-prod (data is repo-wide)

Metrics Pipeline

Application Metrics

Source Metrics Port Key series
hummingbird-dashboard HTTP requests, latency, DB pool 8080 http_requests_total, http_request_duration_seconds, db_pool_connections
hummingbird-status worker SQS processing, ingestion 9090 sqs_messages_processed_total, ingest_records_total
hummingbird-agent LLM cost/tokens, workflows, events, SQS 9090 hummingbird_agent_llm_*, hummingbird_agent_workflow_*, hummingbird_agent_events_total

Infrastructure Metrics

Source Metrics Port Key series
kube-state-metrics K8s object state 8080 kube_deployment_*, kube_pod_*, kube_statefulset_*, kube_job_*
credential-metrics sidecar Token expiry/age 8080 cki_token_expires_at, cki_token_created_at

On-Cluster Components

OTel Collector (kubernetes/otel-collector/ in infrastructure repo):

  • Contrib distribution (ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib)
  • Uses prometheus receiver with kubernetes_sd_configs (role: endpoints) for annotation-based auto-discovery across all Hummingbird namespaces
  • sigv4auth extension for IAM authentication to CloudWatch OTLP
  • metricstarttimeprocessor (true_reset_point) to satisfy CloudWatch’s StartTimeUnixNano requirement
  • metric_relabel_configs drops *_created gauges (invalid OTLP from prometheus_client)

kube-state-metrics (kubernetes/kube-state-metrics/ in infrastructure repo):

  • Image: quay.io/cki/mirror/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.17.0
  • Monitors all non-config Hummingbird namespaces (hummingbird--internal, hummingbird--runner, hummingbird--agent-sandbox, hummingbird--factory-sandbox, hummingbird--k8s-test, hummingbird--runner-jobs)
  • Scoped via --namespaces flag and per-namespace RBAC (Role + RoleBinding)
  • --metric-allowlist limits cardinality to deployment, pod, statefulset, job, and container status metrics

Credential metrics sidecar (nginx container in OTel Collector pod):

  • Serves a static Prometheus text file from a ConfigMap volume
  • ConfigMap populated by credential-metrics/deploy.sh CI job on merge to main (mpp-prod only)
  • Generated from secrets.yml metadata via cki_tools.credentials.metrics
  • Volume uses optional: true so the collector starts even without the ConfigMap

Prerequisites

  • AWS CLI configured with appropriate credentials (IAM permissions for SNS, CloudFormation)
  • Podman or Docker (for containerized SAM build/deploy)

Deployment

Build and deploy using containerized AWS SAM CLI:

cd hummingbird-monitoring
make build     # Build SAM application
make deploy    # First deployment (interactive/guided)
make redeploy  # Subsequent deployments (non-interactive)

Deployment outputs:

  • AlertsTopicArn - SNS topic ARN (for consumer stacks)
  • AlertsTopicName - SNS topic name
  • ServiceHealthDashboardUrl - Service Health dashboard URL
  • AgentOperationsDashboardUrl - Agent Operations dashboard URL
  • K8sHealthDashboardUrl - Kubernetes Health dashboard URL
  • CredentialExpiryDashboardUrl - Credential Expiry dashboard URL

Parameters

Parameter Description Default
TopicName SNS topic name myapp-prod-alerts
AlertEmail Email for notifications (empty=skip) ""

When AlertEmail is non-empty, an SNS email subscription is created. The recipient must confirm via a link in the confirmation email before notifications start flowing.

PromQL Alarms

PromQL alarms evaluate metrics ingested via the CloudWatch OTLP endpoint (pushed by the OTel Collector running in Kubernetes). They use duration-based pending/recovery periods rather than traditional datapoint evaluation.

Credential Expiry

Alarm Condition Eval Interval
token-expiry Active token expires within 28 days 1 h
token-age Active token older than 337 days 1 h

Service Health

Alarm Condition Eval Interval
service-down Scrape target down for 5 min 60 s
high-error-rate HTTP 5xx rate > 5% for 5 min 60 s
sqs-processing-errors SQS processing errors for 5 min 60 s

Kubernetes Workload Health

Alarm Condition Eval Interval
deployment-not-ready Zero ready replicas for 10 min 60 s
statefulset-not-ready Zero ready replicas for 10 min 60 s
pod-crash-looping CrashLoopBackOff for 15 min 60 s
pod-oom-killed OOM-killed with restarts in 15 min 60 s
image-pull-failed ImagePullBackOff for 10 min 60 s
cronjob-failed CronJob-owned Job failed 5 min

Agent Health

Alarm Condition Eval Interval
agent-high-llm-error-rate LLM error rate > 10% over 15 min 60 s
agent-high-cost Projected daily LLM spend > $50 1 h
agent-event-backlog Event processing errors for 5 min 60 s

All alarms notify the shared SNS alerts topic on both ALARM and OK transitions.

Dashboards

Four CloudWatch dashboards provide at-a-glance operational visibility using PromQL queries against OTLP-ingested metrics. Dashboards refresh automatically and cost $3/dashboard/month ($12/month total).

Dashboard Purpose
${stack}-service-health Request rates, error rates, latency, targets
${stack}-agent-operations LLM cost, tokens, workflows, events, errors
${stack}-k8s-health Deployments, StatefulSets, restarts, CronJobs
${stack}-credential-expiry Token expiry timelines and age tracking

Service Health Dashboard

  • HTTP request rate by service
  • 5xx error rate by service
  • Request latency percentiles (p50/p95/p99)
  • Scrape target up/down status
  • SQS processing rate by status
  • Database connection pool usage

Agent Operations Dashboard

  • LLM cost rate by model ($/hr)
  • Token usage by model and direction (tokens/hr)
  • Workflow duration p95 by workflow type
  • Event handling rate by action
  • LLM request error rate
  • Sandbox acquire latency p95

Kubernetes Health Dashboard

  • Deployment ready replica counts
  • StatefulSet ready replica counts
  • Pod restart rate per hour
  • Containers in waiting state (by reason)
  • Failed CronJobs

Credential Expiry Dashboard

  • Days until nearest token expiry (minimum)
  • Count of tokens expiring within 28 days
  • Oldest token age
  • Per-token expiry timeline
  • Per-token age timeline

Operations

Adding a New Scrape Target

To have the OTel Collector scrape a new service:

  1. Ensure the application exposes a Prometheus /metrics endpoint

  2. Add annotations to the K8s Service (in the infrastructure repo):

    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    
  3. Deploy — the collector auto-discovers annotated services via kubernetes_sd_configs

Adding a New PromQL Alarm

Add an AWS::CloudWatch::Alarm resource to hummingbird-monitoring/template.yaml:

MyNewAlarm:
  Type: AWS::CloudWatch::Alarm
  Metadata:
    cfn-lint:
      config:
        ignore_checks: [E3002, E3014]
  Properties:
    AlarmName: !Sub ${AWS::StackName}-my-new-alarm
    AlarmDescription: Description of what this detects
    ActionsEnabled: true
    AlarmActions: [!Ref AlertsTopic]
    OKActions: [!Ref AlertsTopic]
    EvaluationCriteria:
      PromQLCriteria:
        PromQLExpression: |
          some_promql_expression > threshold
        EvaluationInterval: 60
        AlertOnMissing: false
        PendingDuration: 300
        RecoveryDuration: 300

The cfn-lint metadata ignore is required because cfn-lint does not yet recognize EvaluationCriteria.PromQLCriteria (CloudWatch preview feature).

Adding a New Dashboard Widget

Dashboard widgets use the undocumented "type": "chart" format with PromQL:

{
  "type": "chart",
  "x": 0, "y": 0, "width": 12, "height": 6,
  "properties": {
    "title": "Widget Title",
    "view": "line",
    "region": "us-east-1",
    "data": {
      "queries": [{
        "id": "q1",
        "type": "cloudwatch-metrics",
        "language": "PromQL",
        "query": "sum(rate(my_metric_total[5m])) by (label)",
        "step": 60
      }]
    }
  }
}

Troubleshooting

  • Check target health: Query up in CloudWatch Query Studio — each scraped service appears as a time series with value 1 (up) or 0 (down)
  • Check collector logs: oc logs deployment/otel-collector -n hummingbird--internal — look for exporter errors or scrape failures
  • Verify metric names: Use CloudWatch Query Studio with a simple PromQL query like {__name__=~"http.*"} to discover available metrics
  • Stale credential metrics: Re-run the credential_metrics CI job (or merge any change to secrets.yml) to regenerate the ConfigMap

Consumer Stacks

Each consumer stack accepts an AlertTopicArn parameter. When non-empty, CloudWatch alarms publish to the shared topic on state transitions. When empty (the default), alarms still fire but do not send notifications.

Stack Alarms
container-catalog Lambda errors, API 5xx, DLQ depth, metrics iterator age
hummingbird-agent DLQ depth (events, work)
hummingbird-status DLQ depth (events)

Wiring Pattern

Consumer stacks use this pattern to conditionally wire alarms:

Parameters:
  AlertTopicArn:
    Type: String
    Default: ""

Conditions:
  HasAlertTopicArn: !Not [!Equals [!Ref AlertTopicArn, ""]]

Resources:
  MyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      # ... alarm configuration ...
      AlarmActions: !If [HasAlertTopicArn, [!Ref AlertTopicArn], !Ref "AWS::NoValue"]
      OKActions: !If [HasAlertTopicArn, [!Ref AlertTopicArn], !Ref "AWS::NoValue"]

Adding a New Alarm

  1. Define the AWS::CloudWatch::Alarm resource in the stack’s template.yaml with AlarmActions / OKActions using the pattern above.
  2. The stack must already have the AlertTopicArn parameter and HasAlertTopicArn condition (all three consumer stacks already do).

Adding a New Consumer Stack

  1. Add AlertTopicArn parameter and HasAlertTopicArn condition to the stack’s template.yaml.
  2. Add ALERT_TOPIC_ARN to the stack’s vars.sh in the infrastructure repo.
  3. Add AlertTopicArn to parameter_overrides in the stack’s samconfig.toml.j2.

Decision Records

DR-1: CloudWatch OTLP over self-hosted Prometheus or AMP

We chose the CloudWatch OTLP endpoint (public preview, April 2026) over deploying a full Prometheus + AlertManager + Grafana stack or Amazon Managed Prometheus (AMP). This unifies all alerting — existing CloudWatch Alarms (Lambda errors, DLQ depth, SLO) and new PromQL alarms — under one system with one notification path (SNS). No new managed services to provision. The PromQL subset is sufficient for our alert rules. Preview status is acceptable at our scale (~400-600 series, 6 clusters): if the preview breaks, existing alarms are unaffected and the fallback path (switch to AMP) only requires changing the OTel Collector exporter config.

DR-2: OTel Collector over CloudWatch Agent or Prometheus agent

The OTel Collector (contrib distribution) is the only tool that natively bridges Prometheus scraping to CloudWatch OTLP in a single binary. The CloudWatch Agent drops histograms and is limited to 30 dimensions. A Prometheus agent requires AMP as intermediary (remote_write targets AMP, not CloudWatch OTLP). The OTel Collector handles all scrape targets uniformly and its exporter is the only config that changes if we switch backends.

DR-3: kube-state-metrics over OTel k8s_cluster receiver

kube-state-metrics (KSM) exposes kube_* Prometheus metrics — the industry standard used by every community alert rule (kubernetes-mixin, CKI). The OTel Collector k8s_cluster receiver emits different metric names (k8s.deployment.available etc.), has documented parity gaps, and is still beta. Using KSM means all PromQL alert rules from the community work without translation.

DR-4: Credential metrics via ConfigMap and sidecar

Credential metadata (expiry dates, creation dates from secrets.yml) rarely changes but PromQL requires recent data points within its lookback window. A CI-direct push approach would need hourly scheduled pipelines for data that changes a few times per year. Instead, CI generates a Prometheus text file on merge-to-main and applies it as a K8s ConfigMap. A nginx sidecar serves it over HTTP. The collector scrapes every 60 seconds — eliminating staleness without scheduled pipelines. The ConfigMap contains only metadata (names, timestamps), not secret values.

DR-5: CloudWatch dashboards over Grafana

CloudWatch dashboards are native to AWS, require zero new infrastructure, and are defined as AWS::CloudWatch::Dashboard resources in SAM alongside alarms. This aligns with the “minimize new infrastructure” principle. Grafana (AMG or self-hosted) is documented as an upgrade path — the OTel Collector architecture does not change, only the query consumer differs.

DR-6: Prometheus client library over OTel SDK

All Hummingbird services use prometheus_client (Python) for instrumentation, matching the existing dashboard and status worker pattern. The OTel Collector already scrapes Prometheus endpoints natively. Migration to the OTel Python SDK is mechanical (Counter to Counter, Gauge to ObservableGauge, Histogram to Histogram) and can happen per-service when desired. The collector supports both prometheus and otlp receivers, so the architecture does not change either way.

Future Upgrades

Grafana: If richer visualization is needed, Grafana can use CloudWatch’s PromQL endpoint as a Prometheus datasource with SigV4 authentication. The OTel Collector architecture stays the same — only the dashboard platform changes. Dashboard JSON definitions can be version-controlled in the repo.

OTel SDK migration: New services can use the OpenTelemetry Python SDK (opentelemetry-instrumentation-fastapi) for vendor-neutral instrumentation and trace-metric correlation. Existing prometheus_client services continue working unchanged.

Post-preview pricing: CloudWatch OTLP is free during preview. Post-GA pricing is TBD. With ~400-600 series at 60s scrape (~17-26M samples/month), estimated cost at AMP-equivalent rates would be ~$50-80/month. The KSM --metric-allowlist keeps cardinality bounded.

Development

This is a pure infrastructure project (no application code). See the main README for SAM build/deploy commands.

License

This project is licensed under the GNU General Public License v3.0 or later - see the LICENSE file for details.