Hummingbird Monitoring
Shared observability resources for all Hummingbird AWS stacks. Provides a central SNS alerts topic, PromQL-based CloudWatch alarms, operational dashboards for metrics ingested via the OTLP endpoint, and IAM resources for the OpenTelemetry Collector.
This stack handles infrastructure alerting (Lambda errors, DLQ depth, API 5xx). For SLO alerting (CVE exposure burn-rate), see Error Budgets.
Features
- Central Alerts Topic: Single SNS topic shared by all Hummingbird stacks
- PromQL Alarms: 14 alarms evaluating Prometheus metrics in CloudWatch (credential expiry, service health, K8s workload, agent health)
- Operational Dashboards: 4 dashboards visualizing OTLP metrics via PromQL (service health, agent operations, K8s health, credential expiry)
- Email Subscription: Optional email subscription with automatic confirmation workflow
- Opt-in per Stack: Consumer stacks receive the topic ARN as a parameter; alarms are wired only when the ARN is non-empty
Architecture
Metrics flow from Kubernetes services to CloudWatch via an OpenTelemetry Collector, where PromQL alarms and dashboards evaluate them:
flowchart TD
subgraph k8s [OCP Clusters]
dashboard["hummingbird-dashboard\n/metrics :8080"]
status["hummingbird-status worker\n/metrics :9090"]
agent["hummingbird-agent\n/metrics :9090"]
ksm["kube-state-metrics\n/metrics :8080"]
creds["credential-metrics sidecar\n/credentials.prom :8080"]
otel["OTel Collector\nprometheus receiver\n+ sigv4auth"]
end
subgraph aws [AWS us-east-1]
cwOtlp["CloudWatch OTLP Endpoint"]
cwAlarm["PromQL Alarms"]
cwDash["PromQL Dashboards"]
sns["SNS Topic"]
email["email alert group"]
end
dashboard --> otel
status --> otel
agent --> otel
ksm --> otel
creds --> otel
otel -->|"OTLP/HTTP + SigV4"| cwOtlp
cwOtlp --> cwAlarm
cwOtlp --> cwDash
cwAlarm --> sns
sns --> email
Key properties:
- Pull-based: the OTel Collector scrapes Prometheus
/metricsendpoints every 60 seconds using Kubernetes service discovery - Services opt in via
prometheus.io/*annotations on their K8s Service - Metrics retain their original Prometheus names in CloudWatch (no normalization)
- Each cluster adds a
clusterexternal label so metrics are distinguishable - All 6 OCP clusters run the collector; credential metrics only emit from
mpp-prod(data is repo-wide)
Metrics Pipeline
Application Metrics
| Source | Metrics | Port | Key series |
|---|---|---|---|
| hummingbird-dashboard | HTTP requests, latency, DB pool | 8080 | http_requests_total, http_request_duration_seconds, db_pool_connections |
| hummingbird-status worker | SQS processing, ingestion | 9090 | sqs_messages_processed_total, ingest_records_total |
| hummingbird-agent | LLM cost/tokens, workflows, events, SQS | 9090 | hummingbird_agent_llm_*, hummingbird_agent_workflow_*, hummingbird_agent_events_total |
Infrastructure Metrics
| Source | Metrics | Port | Key series |
|---|---|---|---|
| kube-state-metrics | K8s object state | 8080 | kube_deployment_*, kube_pod_*, kube_statefulset_*, kube_job_* |
| credential-metrics sidecar | Token expiry/age | 8080 | cki_token_expires_at, cki_token_created_at |
On-Cluster Components
OTel Collector (kubernetes/otel-collector/ in infrastructure
repo):
- Contrib distribution (
ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib) - Uses
prometheusreceiver withkubernetes_sd_configs(role: endpoints) for annotation-based auto-discovery across all Hummingbird namespaces sigv4authextension for IAM authentication to CloudWatch OTLPmetricstarttimeprocessor(true_reset_point) to satisfy CloudWatch’s StartTimeUnixNano requirementmetric_relabel_configsdrops*_createdgauges (invalid OTLP fromprometheus_client)
kube-state-metrics (kubernetes/kube-state-metrics/ in infrastructure
repo):
- Image:
quay.io/cki/mirror/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.17.0 - Monitors all non-config Hummingbird namespaces (
hummingbird--internal,hummingbird--runner,hummingbird--agent-sandbox,hummingbird--factory-sandbox,hummingbird--k8s-test,hummingbird--runner-jobs) - Scoped via
--namespacesflag and per-namespace RBAC (Role + RoleBinding) --metric-allowlistlimits cardinality to deployment, pod, statefulset, job, and container status metrics
Credential metrics sidecar (nginx container in OTel Collector pod):
- Serves a static Prometheus text file from a ConfigMap volume
- ConfigMap populated by
credential-metrics/deploy.shCI job on merge to main (mpp-prod only) - Generated from
secrets.ymlmetadata viacki_tools.credentials.metrics - Volume uses
optional: trueso the collector starts even without the ConfigMap
Prerequisites
- AWS CLI configured with appropriate credentials (IAM permissions for SNS, CloudFormation)
- Podman or Docker (for containerized SAM build/deploy)
Deployment
Build and deploy using containerized AWS SAM CLI:
cd hummingbird-monitoring
make build # Build SAM application
make deploy # First deployment (interactive/guided)
make redeploy # Subsequent deployments (non-interactive)
Deployment outputs:
AlertsTopicArn- SNS topic ARN (for consumer stacks)AlertsTopicName- SNS topic nameServiceHealthDashboardUrl- Service Health dashboard URLAgentOperationsDashboardUrl- Agent Operations dashboard URLK8sHealthDashboardUrl- Kubernetes Health dashboard URLCredentialExpiryDashboardUrl- Credential Expiry dashboard URL
Parameters
| Parameter | Description | Default |
|---|---|---|
TopicName |
SNS topic name | myapp-prod-alerts |
AlertEmail |
Email for notifications (empty=skip) | "" |
When AlertEmail is non-empty, an SNS email subscription is created. The
recipient must confirm via a link in the confirmation email before notifications
start flowing.
PromQL Alarms
PromQL alarms evaluate metrics ingested via the CloudWatch OTLP endpoint (pushed by the OTel Collector running in Kubernetes). They use duration-based pending/recovery periods rather than traditional datapoint evaluation.
Credential Expiry
| Alarm | Condition | Eval Interval |
|---|---|---|
token-expiry |
Active token expires within 28 days | 1 h |
token-age |
Active token older than 337 days | 1 h |
Service Health
| Alarm | Condition | Eval Interval |
|---|---|---|
service-down |
Scrape target down for 5 min | 60 s |
high-error-rate |
HTTP 5xx rate > 5% for 5 min | 60 s |
sqs-processing-errors |
SQS processing errors for 5 min | 60 s |
Kubernetes Workload Health
| Alarm | Condition | Eval Interval |
|---|---|---|
deployment-not-ready |
Zero ready replicas for 10 min | 60 s |
statefulset-not-ready |
Zero ready replicas for 10 min | 60 s |
pod-crash-looping |
CrashLoopBackOff for 15 min | 60 s |
pod-oom-killed |
OOM-killed with restarts in 15 min | 60 s |
image-pull-failed |
ImagePullBackOff for 10 min | 60 s |
cronjob-failed |
CronJob-owned Job failed | 5 min |
Agent Health
| Alarm | Condition | Eval Interval |
|---|---|---|
agent-high-llm-error-rate |
LLM error rate > 10% over 15 min | 60 s |
agent-high-cost |
Projected daily LLM spend > $50 | 1 h |
agent-event-backlog |
Event processing errors for 5 min | 60 s |
All alarms notify the shared SNS alerts topic on both ALARM and OK transitions.
Dashboards
Four CloudWatch dashboards provide at-a-glance operational visibility using PromQL queries against OTLP-ingested metrics. Dashboards refresh automatically and cost $3/dashboard/month ($12/month total).
| Dashboard | Purpose |
|---|---|
${stack}-service-health |
Request rates, error rates, latency, targets |
${stack}-agent-operations |
LLM cost, tokens, workflows, events, errors |
${stack}-k8s-health |
Deployments, StatefulSets, restarts, CronJobs |
${stack}-credential-expiry |
Token expiry timelines and age tracking |
Service Health Dashboard
- HTTP request rate by service
- 5xx error rate by service
- Request latency percentiles (p50/p95/p99)
- Scrape target up/down status
- SQS processing rate by status
- Database connection pool usage
Agent Operations Dashboard
- LLM cost rate by model ($/hr)
- Token usage by model and direction (tokens/hr)
- Workflow duration p95 by workflow type
- Event handling rate by action
- LLM request error rate
- Sandbox acquire latency p95
Kubernetes Health Dashboard
- Deployment ready replica counts
- StatefulSet ready replica counts
- Pod restart rate per hour
- Containers in waiting state (by reason)
- Failed CronJobs
Credential Expiry Dashboard
- Days until nearest token expiry (minimum)
- Count of tokens expiring within 28 days
- Oldest token age
- Per-token expiry timeline
- Per-token age timeline
Operations
Adding a New Scrape Target
To have the OTel Collector scrape a new service:
-
Ensure the application exposes a Prometheus
/metricsendpoint -
Add annotations to the K8s Service (in the infrastructure repo):
metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "9090" prometheus.io/path: "/metrics" -
Deploy — the collector auto-discovers annotated services via
kubernetes_sd_configs
Adding a New PromQL Alarm
Add an AWS::CloudWatch::Alarm resource to hummingbird-monitoring/template.yaml:
MyNewAlarm:
Type: AWS::CloudWatch::Alarm
Metadata:
cfn-lint:
config:
ignore_checks: [E3002, E3014]
Properties:
AlarmName: !Sub ${AWS::StackName}-my-new-alarm
AlarmDescription: Description of what this detects
ActionsEnabled: true
AlarmActions: [!Ref AlertsTopic]
OKActions: [!Ref AlertsTopic]
EvaluationCriteria:
PromQLCriteria:
PromQLExpression: |
some_promql_expression > threshold
EvaluationInterval: 60
AlertOnMissing: false
PendingDuration: 300
RecoveryDuration: 300
The cfn-lint metadata ignore is required because cfn-lint does not yet
recognize EvaluationCriteria.PromQLCriteria (CloudWatch preview feature).
Adding a New Dashboard Widget
Dashboard widgets use the undocumented "type": "chart" format with PromQL:
{
"type": "chart",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "Widget Title",
"view": "line",
"region": "us-east-1",
"data": {
"queries": [{
"id": "q1",
"type": "cloudwatch-metrics",
"language": "PromQL",
"query": "sum(rate(my_metric_total[5m])) by (label)",
"step": 60
}]
}
}
}
Troubleshooting
- Check target health: Query
upin CloudWatch Query Studio — each scraped service appears as a time series with value 1 (up) or 0 (down) - Check collector logs:
oc logs deployment/otel-collector -n hummingbird--internal— look for exporter errors or scrape failures - Verify metric names: Use CloudWatch Query Studio with a simple PromQL
query like
{__name__=~"http.*"}to discover available metrics - Stale credential metrics: Re-run the
credential_metricsCI job (or merge any change tosecrets.yml) to regenerate the ConfigMap
Consumer Stacks
Each consumer stack accepts an AlertTopicArn parameter. When non-empty,
CloudWatch alarms publish to the shared topic on state transitions. When empty
(the default), alarms still fire but do not send notifications.
| Stack | Alarms |
|---|---|
container-catalog |
Lambda errors, API 5xx, DLQ depth, metrics iterator age |
hummingbird-agent |
DLQ depth (events, work) |
hummingbird-status |
DLQ depth (events) |
Wiring Pattern
Consumer stacks use this pattern to conditionally wire alarms:
Parameters:
AlertTopicArn:
Type: String
Default: ""
Conditions:
HasAlertTopicArn: !Not [!Equals [!Ref AlertTopicArn, ""]]
Resources:
MyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
# ... alarm configuration ...
AlarmActions: !If [HasAlertTopicArn, [!Ref AlertTopicArn], !Ref "AWS::NoValue"]
OKActions: !If [HasAlertTopicArn, [!Ref AlertTopicArn], !Ref "AWS::NoValue"]
Adding a New Alarm
- Define the
AWS::CloudWatch::Alarmresource in the stack’stemplate.yamlwithAlarmActions/OKActionsusing the pattern above. - The stack must already have the
AlertTopicArnparameter andHasAlertTopicArncondition (all three consumer stacks already do).
Adding a New Consumer Stack
- Add
AlertTopicArnparameter andHasAlertTopicArncondition to the stack’stemplate.yaml. - Add
ALERT_TOPIC_ARNto the stack’svars.shin the infrastructure repo. - Add
AlertTopicArntoparameter_overridesin the stack’ssamconfig.toml.j2.
Decision Records
DR-1: CloudWatch OTLP over self-hosted Prometheus or AMP
We chose the CloudWatch OTLP endpoint (public preview, April 2026) over deploying a full Prometheus + AlertManager + Grafana stack or Amazon Managed Prometheus (AMP). This unifies all alerting — existing CloudWatch Alarms (Lambda errors, DLQ depth, SLO) and new PromQL alarms — under one system with one notification path (SNS). No new managed services to provision. The PromQL subset is sufficient for our alert rules. Preview status is acceptable at our scale (~400-600 series, 6 clusters): if the preview breaks, existing alarms are unaffected and the fallback path (switch to AMP) only requires changing the OTel Collector exporter config.
DR-2: OTel Collector over CloudWatch Agent or Prometheus agent
The OTel Collector (contrib distribution) is the only tool that natively
bridges Prometheus scraping to CloudWatch OTLP in a single binary. The
CloudWatch Agent drops histograms and is limited to 30 dimensions. A
Prometheus agent requires AMP as intermediary (remote_write targets AMP, not
CloudWatch OTLP). The OTel Collector handles all scrape targets uniformly and
its exporter is the only config that changes if we switch backends.
DR-3: kube-state-metrics over OTel k8s_cluster receiver
kube-state-metrics (KSM) exposes kube_* Prometheus metrics — the industry
standard used by every community alert rule (kubernetes-mixin, CKI). The OTel
Collector k8s_cluster receiver emits different metric names
(k8s.deployment.available etc.), has documented parity gaps, and is still
beta. Using KSM means all PromQL alert rules from the community work
without translation.
DR-4: Credential metrics via ConfigMap and sidecar
Credential metadata (expiry dates, creation dates from secrets.yml) rarely
changes but PromQL requires recent data points within its lookback window. A
CI-direct push approach would need hourly scheduled pipelines for data that
changes a few times per year. Instead, CI generates a Prometheus text file on
merge-to-main and applies it as a K8s ConfigMap. A nginx sidecar serves it
over HTTP. The collector scrapes every 60 seconds — eliminating staleness
without scheduled pipelines. The ConfigMap contains only metadata (names,
timestamps), not secret values.
DR-5: CloudWatch dashboards over Grafana
CloudWatch dashboards are native to AWS, require zero new infrastructure, and
are defined as AWS::CloudWatch::Dashboard resources in SAM alongside alarms.
This aligns with the “minimize new infrastructure” principle. Grafana (AMG or
self-hosted) is documented as an upgrade path — the OTel Collector
architecture does not change, only the query consumer differs.
DR-6: Prometheus client library over OTel SDK
All Hummingbird services use prometheus_client (Python) for instrumentation,
matching the existing dashboard and status worker pattern. The OTel Collector
already scrapes Prometheus endpoints natively. Migration to the OTel Python
SDK is mechanical (Counter to Counter, Gauge to ObservableGauge, Histogram to
Histogram) and can happen per-service when desired. The collector supports
both prometheus and otlp receivers, so the architecture does not change
either way.
Future Upgrades
Grafana: If richer visualization is needed, Grafana can use CloudWatch’s PromQL endpoint as a Prometheus datasource with SigV4 authentication. The OTel Collector architecture stays the same — only the dashboard platform changes. Dashboard JSON definitions can be version-controlled in the repo.
OTel SDK migration: New services can use the OpenTelemetry Python SDK
(opentelemetry-instrumentation-fastapi) for vendor-neutral instrumentation
and trace-metric correlation. Existing prometheus_client services continue
working unchanged.
Post-preview pricing: CloudWatch OTLP is free during preview. Post-GA
pricing is TBD. With ~400-600 series at 60s scrape (~17-26M samples/month),
estimated cost at AMP-equivalent rates would be ~$50-80/month. The KSM
--metric-allowlist keeps cardinality bounded.
Development
This is a pure infrastructure project (no application code). See the main README for SAM build/deploy commands.
License
This project is licensed under the GNU General Public License v3.0 or later - see the LICENSE file for details.