Hummingbird Agent

An event-driven LLM agent that investigates CI/CD failures and posts findings as GitLab merge request notes. The agent executes markdown-defined workflows using tool calling, with all data processing running inside an isolated sandbox container.

For the architectural design rationale, module boundaries, security invariants, and design decision registry, see Agent Design. For the model loop wire format, see Agent Model Loop.

flowchart TD
    Pipeline["Pipeline Fails"]
    MREvent["MR Created / Updated"]
    Slash["/hummingbird command"]

    Pipeline -->|"event"| Agent
    MREvent -->|"event"| Agent
    Slash -->|"command"| Agent

    subgraph Agent ["Hummingbird Agent"]
        APIs["GitLab + Konflux +<br/>Testing Farm"]
        Model["LLM<br/>(Gemini / Claude)"]
        subgraph Sandbox ["Isolated Sandbox (no network)"]
            Tools["jq / python3 / yq<br/>data processing"]
        end
        APIs <-->|"data"| Model
        Model <-->|"commands"| Tools
    end

    Note["MR Note<br/>(analysis or review)"]

    Agent -->|"posts"| Note
    Note -->|"reply to continue"| Agent

Features

  • Workflow-driven analysis - Investigation logic lives in .md files, not in code; easy to iterate without redeployment
  • Centralized YAML config - A single config file defines operational settings, workflows, enabled data sources with token env var names, project allowlists, and per-project limits
  • Sandboxed execution - All untrusted commands (jq, python3, shell) run in an isolated container, never on the host
  • Five sandbox backends - Podman for local development (network-isolated), direct K8s pod creation, Deployment-backed pod pool for low-latency production use (restricted-v2 SCC compliant), KubeVirt VM (direct), or VMIRS-backed VM pool for heavier workloads requiring full OS isolation
  • Data source abstraction - GitLab, Konflux, and Testing Farm are registered as tool-calling functions the model invokes directly
  • Auto-spill for large outputs - Stdout/stderr and data source responses exceeding 4 KB are automatically saved to sandbox files with a compact preview returned to the model, keeping context usage bounded
  • Prompt caching - Gemini uses implicit server-side caching automatically; Claude uses explicit sliding-window cache breakpoints that reduce input token costs by ~80% on multi-turn agent runs
  • Token budget management - Per-call context ceiling and iteration-based soft/hard limits prevent runaway sessions
  • Session persistence - Conversation history, transcript, and sandbox files saved to S3 (production) or local directory (development) for debugging and future session resumption. Sessions are stored in a provider-neutral format, enabling model switching between conversations.

Architecture

flowchart LR
    subgraph input [Input]
        SQS["SQS Queue"]
        CLI["CLI --event"]
    end

    subgraph agentLoop [Agent Loop]
        WF["Workflow .md<br/>(system prompt)"]
        LLM["LLM&lt;br/&gt;(Gemini / Claude)"]
        TR["Tool Registry"]
    end

    subgraph tools [Tools]
        SE["sandbox_exec"]
        FTS["fetch_to_sandbox"]
        DS["Data Sources"]
    end

    subgraph sandbox [Sandbox Container]
        JQ["jq / python3 / yq"]
        Files["Spilled files"]
    end

    subgraph external [External APIs]
        GL["GitLab API"]
        KX["Konflux K8s"]
        TF["Testing Farm"]
    end

    subgraph output [Output]
        Note["GitLab MR Note"]
        Session["Session State"]
    end

    SQS --> WF
    CLI --> WF
    WF --> LLM
    LLM -->|"tool calls"| TR
    TR --> SE --> sandbox
    TR --> FTS --> sandbox
    TR --> DS
    DS --> GL
    DS --> KX
    DS --> TF
    DS -->|"auto-spill"| sandbox
    LLM -->|"final text"| Note
    LLM --> Session

An event (CLI --event JSON or SQS message) identifies a GitLab project and MR IID. The config file maps project paths to workflows and provides action, max_iterations, enabled data sources, and token env var names. Both run and serve use this config. The markdown body of the prompt file becomes the LLM system prompt. The agent loop iterates until the model produces a final text response or hits the iteration/context limit. Tool calls are dispatched through the ToolRegistry: sandbox_exec runs shell commands, fetch_to_sandbox pipes data source output into the sandbox, and direct data source calls return results inline (or auto-spill large responses to files).

For a detailed walkthrough of the model loop – what gets sent to the model each iteration, how tool calls flow, the exact wire format, and how user replies integrate for session resumption – see Agent Model Loop.

Event-Driven Triggers

In production, the agent consumes events from an SQS queue subscribed to the central SNS topic. The SNS filter policy delivers three event types:

  • gitlab::pipeline – Fires when a GitLab CI pipeline completes. The agent triggers on status=failed pipelines from merge_request_event sources where the triggering user has at least Developer access on the project. Only workflows with trigger: pipeline are executed. Since the pipeline stays open until all Konflux external stages resolve, this naturally waits for all builds and tests to finish before triggering.

  • gitlab::merge_request – Fires on MR open and update events. Only workflows with trigger: merge_request are executed. The agent triggers in three cases: a new MR is opened (action=open), code is pushed to an existing MR (action=update with oldrev), or a draft MR is marked as ready (action=update with a changes.draft transition from true to false). All other update events (labels, assignees, description changes) are skipped – they carry no new code. Draft MRs are skipped; SHA-based deduplication prevents reviewing the same code revision twice. Access checks and trigger_rules filtering evaluate the push author: for open and push events the webhook user is the pusher; for undraft transitions the agent resolves the actual pusher from MR system notes so that the person undrafting cannot bypass the access check on code pushed by an untrusted user.

  • gitlab::note – MR comment events. Two sub-flows:

    • Slash command (/hummingbird <workflow-name>): triggers a specific workflow. Prefix matching is supported (e.g. /hummingbird analyze matches analyze-failures). /hummingbird or /hummingbird help lists available workflows. The note author must have Developer+ access on the project. Optional runtime overrides can be appended: /hummingbird code-review model=claude-sonnet-4-6 max_iterations=25.
    • Reply to agent note: when a user replies to an existing agent note (which contains a session marker), the agent loads the previous session from S3 (conversation history + sandbox files) and continues the conversation with the user’s reply as input. The system prompt includes a CONTINUATION_PROMPT that prevents the model from re-running the full workflow. If the session is not found (expired/deleted), the agent falls back to a cold start. Replies may include overrides on a separate line (e.g. /hummingbird model=claude-opus-4-6); override lines are stripped from the user message. Overrides persist in the session until explicitly changed.

    Notes generated by the agent itself (containing session markers) are skipped to prevent infinite loops. Reply threading respects the internal_notes config: if the project requires internal notes but the original thread was public, the reply is posted as a new top-level internal note instead.

Both pipeline and merge_request triggers apply per-workflow trigger rules filtering. Each workflow may define an ordered trigger_rules array; the first matching rule decides whether the event is allowed or denied (implicit deny on fallthrough). Rules can match on pipeline status, user_regex/user_regexes, branch_regex/branch_regexes, and title_regex/title_regexes. All conditions within a rule are ANDed; multiple patterns within a regex field are ORed. The ! prefix negates a pattern. If no trigger_rules are specified, merge_request triggers allow all events and pipeline triggers default to allowing only failed status (preserving backward compatibility).

The legacy ignore_users and ignore_branches fields are still accepted and automatically desugared into equivalent trigger_rules (deny + catch-all allow). They cannot be mixed with explicit trigger_rules.

Rate limiting is per-workflow: each workflow’s thread count is tracked independently via JSON session markers that embed the workflow name and commit SHA. The max_runs_per_mr limit applies separately to each workflow on a given MR.

Events flow through a two-stage SQS pipeline. A slim ingress router forwards webhooks from the standard queue to an SQS FIFO queue, grouped by discussion_id for note events (ensuring same-discussion ordering) and by SQS MessageId for pipeline/MR events (no serialization needed). Phase 1 handlers validate the event and resolve the session, then post a continuation message back to the FIFO grouped by session_id. Phase 2 picks up the continuation and runs the workflow. This ensures all work for a given session is serialized even across multiple discussion threads.

The SQS infrastructure is defined in template.yaml (SAM/CloudFormation): a standard ingress queue (60s visibility timeout for the router hop) and a FIFO work queue (30-minute visibility timeout for workflow execution).

Model Configuration

The agent supports multiple LLM providers via a unified adapter interface.

Gemini

  • API key (local dev) – Set GOOGLE_API_KEY. Calls generativelanguage.googleapis.com directly. No region configuration needed.
  • Vertex AI (production) – Set GOOGLE_CLOUD_PROJECT and configure model_regions in the YAML settings. Uses Application Default Credentials (ADC) via google-auth. For OpenShift, mount a service account key JSON file and set GOOGLE_APPLICATION_CREDENTIALS, or use workload identity.

Gemini uses implicit server-side caching – repeated prefixes are automatically cached by the Vertex AI backend with no opt-in required. Cached input tokens are billed at 25% of the base input rate. The agent tracks cached token counts from API responses for cost estimation.

Anthropic Claude

Claude models are accessed via Vertex AI using the same GCP project (GOOGLE_CLOUD_PROJECT) with the region resolved from model_regions. Enable the desired model in the GCP Model Garden. Use model: claude-sonnet-4-20250514 in workflow config.

Prompt caching is enabled automatically for Claude models. The agent places explicit cache breakpoints on messages so that the full conversation prefix is served from cache on every turn after the first. Cache writes cost 1.25x the base input rate; cache reads cost 0.10x (90% discount). For a typical 20-iteration agent run, this reduces input token costs by approximately 80%. Ephemeral messages (iteration warnings, nudges) are excluded from cache writes to avoid polluting the cache with transient content.

Region configuration

GCP regions are configured via settings.model_regions in the config file – a map of model name prefixes to GCP regions. The agent resolves the region by longest-prefix match on the effective model name (same algorithm as cost estimation). Example:

settings:
  model_regions:
    gemini-2.5-pro: us-east5
    gemini-3.1-pro: global
    claude: global

At least one of GOOGLE_API_KEY or GOOGLE_CLOUD_PROJECT must be set. If both are set, the API key takes precedence for Gemini models. Claude models always require Vertex AI mode (GOOGLE_CLOUD_PROJECT + model_regions); GOOGLE_API_KEY direct mode is for Gemini only.

Prerequisites

  • Python 3.11+
  • Podman (for local sandbox) or kubectl (for K8s sandbox)
  • Authentication: GOOGLE_API_KEY or GOOGLE_CLOUD_PROJECT (see above)
  • For Claude models: enable the desired model in GCP Model Garden
  • GitLab tokens referenced in the config file (model tool tokens, orchestrator tokens)
  • Kubeconfig with access to the Konflux cluster (if using Konflux data sources)

Installation

cd hummingbird-agent
pip install -e .

Usage

Local development (run)

The run command uses the config file for workflow lookup, data source registration, and token resolution – the same code path as serve. By default, results are printed to stdout (dry-run). Use --execute to post the result as a GitLab MR note.

There are two ways to select what to run:

Direct workflow selection (--workflow + --project + --event):

# Run a specific workflow on an MR, Podman sandbox, print to stdout
hummingbird-agent run \
    --workflow analyze-failures \
    --project org/group/project \
    --event '{"iid": 123, "sha": "abc123"}'

# Same but with K8s sandbox
hummingbird-agent run \
    --workflow analyze-failures \
    --project org/group/project \
    --event '{"iid": 123, "sha": "abc123"}' \
    --context my-cluster/my-namespace

# Post the result as a GitLab MR note (also saves session to S3 if configured)
hummingbird-agent run \
    --workflow analyze-failures \
    --project org/group/project \
    --event '{"iid": 123, "sha": "abc123"}' \
    --execute

# Save session locally for debugging (context.json, transcript.md, sandbox.tar.gz)
hummingbird-agent run \
    --workflow analyze-failures \
    --project org/group/project \
    --event '{"iid": 123, "sha": "abc123"}' \
    --save-session /tmp/my-session

# Resume from a saved session with a follow-up question
hummingbird-agent run \
    --workflow analyze-failures \
    --project org/group/project \
    --event '{"iid": 123, "sha": "abc123"}' \
    --resume-session /tmp/my-session \
    --message "Can you look at the clair-scan timeout more closely?"

# Chain: resume and save the new session for another round
hummingbird-agent run \
    --event-file event.json \
    --resume-session /tmp/my-session \
    --message "What layer hash failed?" \
    --save-session /tmp/my-session-2

Event-file replay (--event-file): routes the event through the same project-index lookup as serve, but skips rate limiting and status filtering:

# Replay a real webhook event, dry-run with Podman
hummingbird-agent run \
    --event-file event.json

# Replay with K8s sandbox and post notes
hummingbird-agent run \
    --event-file event.json \
    --context my-cluster/my-namespace \
    --execute

Production (serve)

# Poll SQS queue for events (pool sandbox by default, posts results as MR notes)
CONFIG_PATH=config.yml hummingbird-agent serve

Requires CONFIG_PATH pointing to a config file with settings.sqs_queue_url and settings.sqs_fifo_queue_url set. The config file defines which workflows run on which projects, with per-project limits and data source token mappings. Handles SIGTERM/SIGINT for graceful shutdown. A background router thread forwards events from the standard queue to the FIFO; the main thread consumes the FIFO with a semaphore-gated thread pool (controlled by settings.max_concurrent_agents) so excess messages stay in SQS for other instances.

Config hot-reload: In serve mode a background thread polls the config file for changes (every 5 seconds by default). When the file changes, the new config is validated and atomically swapped in – subsequent event dispatches use the updated config. If the new config is invalid, the previous config is kept and a warning is logged. No restart required for config changes.

The serve command also accepts --sandbox, --context, and --namespace for local development with a different sandbox backend (e.g. Podman).

CLI Options

run subcommand:

Option Description Default
--event Inline event JSON string (mutually exclusive with --event-file) -
--event-file Read event from a JSON file (mutually exclusive with --event) -
--workflow Workflow name from config file (requires --project) -
--project GitLab project path (requires --workflow) -
--execute Post result as GitLab MR note and save session to S3 -
--save-session Save session artifacts to this directory -
--resume-session Resume from a saved session directory (requires --message) -
--message Follow-up message for session resumption (requires --resume-session) -
--sandbox Sandbox backend (podman, k8s, k8spool, kubevirt, or kubevirtpool) podman
--context K8s context (implies K8s backend) -
--namespace K8s namespace -
-v, --verbose Enable debug logging -

serve subcommand:

Option Description Default
--sandbox Sandbox backend (podman, k8s, k8spool, kubevirt, or kubevirtpool) k8spool
--context K8s context (implies K8s backend) -
--namespace K8s namespace -
-v, --verbose Enable debug logging -

Configuration

Config file

Both run and serve use a single YAML config file. Set the path via CONFIG_PATH (default: config.example.yaml). The config has two sections: settings for operational parameters, and workflows for workflow definitions. The settings section provides defaults that can be omitted for local development (sensible defaults are used).

settings:
  gitlab_url: https://gitlab.com                # GitLab instance URL
  sandbox:                                      # sandbox pod configuration
    image: quay.io/.../gitlab-ci:latest         #   container image (k8s mode only)
    namespace: default                          #   K8s namespace (required for K8s/pool/kubevirt)
    active_deadline_seconds: 1800               #   pod hard timeout / reap interval
    linger_seconds: 300                         #   keep pod alive after success (pool mode, 0=disable)
    max_lingering_pods: 2                       #   max idle lingering pods before eviction (pool mode)
    vm_image: quay.io/.../vm-disk:latest        #   containerDisk image (kubevirt mode only)
    vm_memory: 1Gi                              #   VM guest memory (kubevirt mode only)
    vm_ssh_private_key: /path/to/key            #   SSH private key for pool VMs (kubevirtpool mode)
    metadata:                                   #   pod/VMI metadata (k8s/kubevirt mode)
      labels:
        app.kubernetes.io/name: hummingbird-agent-sandbox
    resources:                                  #   K8s resource requests/limits (k8s mode)
      requests:
        cpu: "100m"
        memory: "256Mi"
        ephemeral-storage: "256Mi"
      limits:
        cpu: "1"
        memory: "1Gi"
        ephemeral-storage: "2Gi"
  max_concurrent_agents: 4                      # max concurrent workflows (serve)
  sqs_queue_url: ""                             # standard SQS ingress queue URL (serve)
  sqs_fifo_queue_url: ""                        # FIFO work queue URL (serve)
  s3_session_bucket: ""                         # S3 bucket for session persistence
  model: gemini-3.1-pro-preview  # or claude-sonnet-4-20250514
  model_regions:                                 # model prefix -> GCP region
    gemini-2.5-pro: us-east5
    gemini-3.1-pro: global
    claude: global
  max_iterations: 30                             # default iteration limit
  max_runs_per_mr: 5                             # default per-MR rate limit
  internal_notes: true                           # default note visibility
  docs_url: https://gitlab.com/org/group/project/-/blob/main/docs/agent.md
  source_url: https://gitlab.com/org/group/project
  slack_url: https://slack.example.com/archives/C0123456789
  slack_label: "#my-channel"

workflows:
  code-review:
    trigger: merge_request                     # auto-trigger on MR events
    description: Performs AI-powered code review
    sandbox: k8spool                           # per-workflow sandbox override (optional)
    workflow_url: https://gitlab.com/org/group/project/-/blob/main/workflows/code-review.md
    trigger_rules:                             # ordered rule chain, first match wins
      - user_regex: "renovate\\[bot\\]"        # deny bot MRs (regex fullmatch)
        action: deny
      - branch_regex: "chore/.*"               # deny maintenance branches
        action: deny
      - action: allow                          # allow everything else
    prompt: workflows/code-review.md
    action: post_gitlab_note
    model: gemini-3.1-pro-preview  # or claude-sonnet-4-20250514
    max_iterations: 15
    max_inline_size: 200000                    # keep full diffs in context
    context_limit: 500000                      # Gemini 3.1 Pro / Claude Sonnet 4 have large context windows
    data_sources:
      gitlab:
        token_env: GITLAB_TOKEN_RO
    projects:
      org/group/project: {}

  analyze-failures:
    trigger: pipeline                          # auto-trigger on failed pipelines
    description: Investigates CI/CD pipeline failures
    sandbox: kubevirtpool                      # use VM sandbox for heavier workloads
    auto_resolve_on_push: true                 # resolve threads when a new SHA is pushed
    auto_resolve_on_success: true              # resolve threads when pipeline succeeds
    workflow_url: https://gitlab.com/org/group/project/-/blob/main/workflows/analyze-failures.md
    prompt: workflows/analyze-failures.md       # relative to config file dir
    action: post_gitlab_note
    model: gemini-3.1-pro-preview  # or claude-sonnet-4-20250514
    max_iterations: 50                          # per-workflow iteration override

    data_sources:                               # model tool tokens (read-only)
      gitlab:
        token_env: GITLAB_TOKEN_RO              # env var name, not the token
      konflux:
        cluster_url: https://example.com:6443/ns/my-tenant
        kubeconfig_env: KUBECONFIG
        kubearchive_url: https://kubearchive-api-server-product-kubearchive.apps.example.com
      testing_farm: {}

    projects:
      redhat/hummingbird/containers:
        tokens:                                 # per-project model token overrides
          gitlab: GITLAB_TOKEN_CONTAINERS_RO
        action_tokens:                           # per-workflow write tokens
          gitlab: HUMMINGBIRD_AGENT_ACTION_ANALYZE_FAILURES_GITLAB_TOKEN_CONTAINERS

With --workflow/--project, the workflow and project are looked up directly in the config. With --event-file, the project is extracted from the event body and matched against the project index to find applicable workflows.

Discussion threads

All workflow results are posted as discussion threads: a placeholder note starts the discussion and the full result is posted as a reply. The placeholder is never edited, so email notifications include the actual result text. For slash commands, the result is posted as a reply in the triggering discussion.

Auto-resolve

Workflows can opt in to automatic resolution of their discussion threads:

  • auto_resolve_on_push (default false): When a new commit is pushed to the MR (i.e. a merge_request event with action: update), all agent discussion threads for the workflow whose SHA differs from the new HEAD are resolved. This clears stale failure analyses when the developer pushes a fix.
  • auto_resolve_on_success (default false): When the head pipeline succeeds, all agent discussion threads for the workflow on that MR are resolved (regardless of SHA). This handles pipeline reruns on the same SHA where a transient failure is now green.

Both flags are independent and can be combined. Resolution runs before rate limit checks, so threads are resolved even if the workflow’s per-MR run limit has been reached.

Workflows can enable Anthropic’s built-in web search server tool by listing web_search as a data source:

workflows:
  renovate-babysit:
    model: claude-sonnet-4-20250514
    data_sources:
      gitlab: {}
      web_search: {}
    # ...

When web_search is present in data_sources, the web_search_20250305 server tool is included in API requests to Claude. The model decides autonomously when to search. Search execution happens server-side (no client-side tool dispatch), and results appear as server_tool_use / web_search_tool_result content blocks in the response. These blocks are preserved through session save/resume.

Unlike other data sources, web_search has no configuration options and does not register any orchestrator-side tools – it is handled entirely by the model provider.

If the API returns a pause_turn stop reason (server-side search loop hit its iteration limit), the adapter automatically re-sends the conversation to continue, up to 5 continuations per generate() call.

Web search is only supported with Claude models. The setting is ignored for Gemini.

The first agent-authored note (placeholder) includes a footer with links to documentation, source code, the Slack channel, the workflow prompt, and a continuation prompt. These links are configured via global settings:

Setting Description
settings.docs_url Link to agent documentation
settings.source_url Link to agent source repository
settings.slack_url Link to support Slack channel
settings.slack_label Display text for Slack link (default: “Slack”)

Per-workflow, set workflow_url to link to the workflow’s prompt file. The footer stays on the placeholder and the result is a separate reply.

Token separation

Tokens are split into three tiers that never mix:

  • Model tool tokens (in YAML data_sources / tokens): read-only tokens passed to the LLM’s tool calls. Declared in the config file as env var names. These are user-defined and resolved at runtime from the referenced env vars. Create as project access tokens with Reporter role and read_api scope. Reporter is the minimum role required to see internal (confidential) notes in the discussions tool.
  • Workflow action tokens (in YAML action_tokens, per-project): write tokens used by the orchestrator for per-workflow GitLab writes (notes, thread resolution). Each workflow gets a dedicated bot user per project, providing clear audit trails for which agent produced each note. Declared in the project config as env var names. Create as project access tokens with Developer role and api scope. These tokens are never exposed to the LLM and never enter the ToolRegistry. Naming convention: HUMMINGBIRD_AGENT_ACTION_<WORKFLOW>_GITLAB_TOKEN_<PROJECT>.
  • Orchestrator tokens (ORCHESTRATOR_* env vars, NOT in YAML): tokens used by the runner for operational reads (member access checks, push author lookup, head pipeline queries) and infrastructure notes (access-denied replies, rate-limit notices). Create as project access tokens with Developer role and read_api scope (or api for infrastructure notes). Developer role is required because the discussions tool trust-filters notes by author access level (>= Developer); if the orchestrator bot has only Reporter access, its own notes are redacted. Resolved by convention: ORCHESTRATOR_GITLAB_TOKEN_<MANGLED_PROJECT> (per-project) or ORCHESTRATOR_GITLAB_TOKEN (fallback). The ORCHESTRATOR_ prefix makes these impossible to confuse with model tokens.

Environment Variables

The agent reads only secrets and authentication from environment variables. All operational settings come from the config file’s settings section.

Variable Required Default Description
CONFIG_PATH no config.example.yaml Path to config YAML
GOOGLE_API_KEY yes* - Gemini API key; Gemini direct mode only (not for Claude)
GOOGLE_CLOUD_PROJECT yes* - GCP project ID (Vertex AI mode); required for Claude models
ORCHESTRATOR_GITLAB_TOKEN serve - Orchestrator GitLab token (global fallback)
ORCHESTRATOR_GITLAB_TOKEN_<PROJECT> no - Per-project orchestrator token
SENTRY_DSN no - Sentry DSN for error tracking

*One of GOOGLE_API_KEY or GOOGLE_CLOUD_PROJECT is required. Claude models require Vertex AI (GOOGLE_CLOUD_PROJECT); GOOGLE_API_KEY is for Gemini direct mode only.

Model tool tokens (e.g. GITLAB_TOKEN_RO, GITLAB_TOKEN_CONTAINERS_RO) and data source credentials (e.g. KONFLUX_CLUSTER_URL, KUBECONFIG) are referenced by name in the config file’s data_sources and tokens sections. They are not listed in the table above because their names are user-defined.

Security and Design Constraints

The agent is designed to run in a shared OpenShift cluster without cluster-admin access, processing potentially untrusted merge requests. These constraints shaped the architecture:

Sandbox isolation. All arbitrary commands executed by the LLM run inside an ephemeral container, never on the host:

  • Podman (local): --network=none, --user 65532, no host mounts. Complete network isolation.
  • Kubernetes (production): Pods comply with OpenShift’s restricted-v2 Security Context Constraint: runAsNonRoot, seccompProfile: RuntimeDefault, allowPrivilegeEscalation: false, capabilities.drop: ["ALL"], automountServiceAccountToken: false (no K8s API access from sandbox), activeDeadlineSeconds (configurable, default 1800). Security context fields are hardcoded in the pod manifest for portability to vanilla Kubernetes with Pod Security Admission (restricted level). Resource requests/limits, metadata, and activeDeadlineSeconds are configurable via settings.sandbox in the config file. Network access is denied by a NetworkPolicy on the sandbox namespace that blocks all egress from all pods (podSelector: {}).

No cluster-admin required. The agent operates with namespace-scoped permissions only. The orchestrator’s ServiceAccount needs only:

  • pods: create, get, list, delete, patch – sandbox pod lifecycle and pool claims
  • pods/exec: create – command execution via kubectl exec
  • virtualmachineinstances.kubevirt.io: create, get, list, delete, patch – KubeVirt VMI lifecycle (kubevirt/kubevirtpool modes only)
  • secrets: create, get, delete – SSH key Secrets for VMI provisioning (kubevirt/kubevirtpool modes only)

These permissions are granted via a Role in the sandbox namespace, not the orchestrator’s own namespace. No CRDs, no custom runtimes, no cluster-scoped resources. Konflux data is fetched via bearer token from kubeconfig, not from inside the cluster.

Namespace separation. Sandbox pods are created in a dedicated namespace, separate from the orchestrator. This limits blast radius: even if a sandbox pod is compromised, it has no visibility into the orchestrator’s Secrets, Pods, or ServiceAccount tokens. The sandbox namespace is locked down with standard K8s resources:

  • RBAC: Role + RoleBinding scoped to the namespace, granting only the permissions above to the orchestrator’s ServiceAccount
  • NetworkPolicy: uses podSelector: {} to select all pods in the dedicated sandbox namespace, denying all egress (egress: []). The sandbox cannot reach the internet, the K8s API, or other pods.
  • activeDeadlineSeconds: sandbox pods self-terminate after the configured timeout (default 1800s / 30 minutes) even if the orchestrator crashes or is killed, preventing orphaned pods

Credential separation. Data source credentials (GitLab tokens, kubeconfig) live in the orchestrator process only, injected via K8s Secrets. The sandbox container has no credentials, no SA token (automountServiceAccountToken: false), and no network access. Data flows into the sandbox via stdin piping through write_file.

Command execution via kubectl exec. The K8s sandbox uses a hybrid approach: the Kubernetes Python client manages pod lifecycle (create, wait, delete), while kubectl exec handles command execution. This avoids the complexity and reliability issues of the websocket-based exec API.

KubeVirt VM isolation. KubeVirt sandbox VMs run as root inside the guest OS, but the VM itself is contained by the KubeVirt hypervisor (QEMU/KVM). The VM has no access to the Kubernetes API, no ServiceAccount token, and no credentials. SSH keys are ephemeral (generated per sandbox start for direct mode, or shared per pool for pool mode) and cleaned up with the VMI.

Sandbox Backends

Podman (local) K8s (direct) K8sPool (production) KubeVirt (direct) KubeVirtPool
Start podman run -d --network=none create_namespaced_pod Claim standby pod from Deployment Create VMI + SSH Secret Claim standby VMI from VMIRS
Exec podman exec kubectl exec kubectl exec ssh ssh
Auth Local Podman socket In-cluster SA or kubeconfig In-cluster SA or kubeconfig In-cluster SA or kubeconfig In-cluster SA or kubeconfig
Network None (--network=none) None (deny-all NetworkPolicy) None (deny-all NetworkPolicy) Cluster pod network (SSH) Cluster pod network (SSH)
User 65532 (fixed) Namespace UID range (SCC) Namespace UID range (SCC) root (inside VM) root (inside VM)
Cleanup podman rm -f delete_namespaced_pod delete_namespaced_pod Delete VMI + Secret + temp keys Delete VMI

All five implement the Sandbox protocol: start(), exec(), write_file(), read_file(), cleanup(), linger().

The pod pool backend (k8spool) eliminates pod startup latency by claiming pre-warmed pods from a Kubernetes Deployment. Claimed pods are detached from the ReplicaSet and the Deployment automatically creates replacements. After a successful workflow, pool pods linger for linger_seconds (default 300, configurable, 0 to disable) so that user replies can reuse the same pod without re-creating it or restoring from S3. The reaper runs once per workflow execution and deletes expired lingering pods. It also evicts excess lingering pods beyond max_lingering_pods (default 2), starting with those closest to their deadline. See the design doc (section 8.8) for details.

The KubeVirt backends (kubevirt, kubevirtpool) provide full VM isolation using KubeVirt VirtualMachineInstances. The VM boots from a containerDisk image and SSH keys are injected via a Kubernetes Secret volume (the VM image’s inject-ssh-keys.service reads from /dev/disk/by-id/virtio-ssh-pubkeys). Command execution uses SSH instead of kubectl exec. The kubevirtpool backend claims pre-warmed VMIs from a VirtualMachineInstanceReplicaSet, with the same claim/reap/linger semantics as the pod pool. The sandbox backend can be set per-workflow via the sandbox: field in the workflow config, with resolution order: workflow config > CLI --sandbox > default.

Data Sources

Data sources are registered as tool-calling functions. The model invokes them by name; the orchestrator executes them and returns results (or auto-spills large responses to the sandbox).

GitLab

Tool Description
gitlab_get_mr_details MR metadata (title, author, state, SHA, labels)
gitlab_get_mr_unified_diff Complete unified diff in patch format
gitlab_get_mr_diff Per-file structured change data
gitlab_get_mr_commits List of commits in a merge request
gitlab_get_mr_discussions Discussion threads with redacted agent transcripts and trust-filtered comments
gitlab_get_commit_statuses CI/CD pipeline statuses for a commit
gitlab_get_file_at_ref Raw file content at a git ref
gitlab_get_repo_archive Repository tar.gz (binary, auto-spilled)
gitlab_get_job_log CI job trace output (ANSI codes stripped)

Konflux

Fetches Tekton PipelineRuns and TaskRuns from both the live K8s API and Kubearchive (for completed resources), with deduplication by UID.

Tool Description
konflux_list_pipelineruns All PipelineRuns for a commit SHA
konflux_list_taskruns All TaskRuns for a commit SHA
konflux_list_pods All pods for a commit SHA
konflux_get_pod Full pod resource (spec, status, conditions, container statuses)
konflux_get_pod_log Pod logs; optional container param, fetches all containers when omitted

Response metadata includes konflux_ui base URL for building reviewer-facing links.

Testing Farm

Tool Description
tf_get_results JUnit XML results for a request ID
tf_get_test_log Individual test log by URL (restricted to Testing Farm artifact URLs)
tf_get_request_status Request state, queue/run times

Response metadata includes artifacts_base URL for building artifact links.

Workflow System

Workflows are .md files whose content becomes the LLM system prompt verbatim. Workflow metadata (action, model, max_iterations, enabled data sources, project allowlists) is defined in the config file. The .md file is pure system prompt text.

Available workflows:

  • analyze-failures.md - Investigates CI/CD pipeline failures by fetching MR details, identifying failed pipelines via commit statuses, retrieving PipelineRuns/TaskRuns from Konflux, analyzing test results from Testing Farm, and producing a grouped root-cause report with reviewer-facing URLs.
  • code-review.md - Performs AI-powered code review by fetching MR details, the unified diff, and prior discussion threads in parallel, then producing structured feedback with severity ratings, code examples, and actionable suggestions. On follow-up reviews (after SHA updates), the agent sees its own previous findings, developer responses, and resolved threads – avoiding duplicate findings and respecting developer explanations. Uses elevated max_inline_size (200 KB) and context_limit (500K tokens) to keep the full diff in context.
  • renovate-babysit.md - Triages a single Renovate-authored MR triggered by a successful pipeline. Classifies the MR as safe or risky based on diff scope, upstream dependency changes (from MR description, web search), and local codebase impact (via gitlab_get_repo_archive + grep). Posts a structured note with verdict, upstream change summary, and suggested actions for risky MRs. Requires a Claude model with web_search data source.

Token Budget Management

Both Gemini and Claude benefit from prompt caching that reduces the effective cost of full history replay. Cached token counts from both providers feed into the estimate_cost() calculation. See Agent Model Loop – Prompt caching for details on how caching works per provider.

The agent uses a dual-limit approach instead of a cumulative token budget:

  • Iteration limit (settings.max_iterations, default 30, overridable per-workflow) - Hard cap on tool-calling rounds. A wrap-up prompt is injected at 80% (SOFT_ITERATION_RATIO).
  • Context limit (CONTEXT_LIMIT, default 60,000 tokens, overridable per-workflow via context_limit) - Per-call input token ceiling. When exceeded, a wrap-up prompt forces the model to finalize.

Large outputs are automatically redirected to sandbox files to keep the LLM context small. The spill threshold defaults to 4 KB (MAX_INLINE_SIZE) but can be overridden per-workflow via max_inline_size in the config:

  • sandbox_exec - stdout/stderr exceeding the threshold saved to /tmp/_out/{N}.txt; model receives a preview (head + tail) with file path
  • Data sources - text exceeding the threshold saved to /tmp/_out/{name}_{N}.txt with preview; binary data saved to .bin
  • fetch_to_sandbox - always writes to the caller-specified path; returns metadata only

All tuning constants are centralized in config.py:

Constant Default Purpose
DEFAULT_MAX_ITERATIONS 30 Hard iteration cap
SOFT_ITERATION_RATIO 0.8 Inject wrap-up at this fraction
CONTEXT_LIMIT 60,000 Per-call input token ceiling
OUTPUT_PREVIEW_BYTES 4,096 Preview size for spilled outputs
OUTPUT_TAIL_BYTES 512 Extra tail appended to previews
MAX_INLINE_SIZE 4,096 Max inline size for data source responses

Development

See the main README for development workflows.

make hummingbird-agent/setup  # Install dependencies
make check                    # Lint code (ruff)
make fmt                      # Format code
make test                     # Run unit tests
make coverage                 # Run tests with coverage

License

This project is licensed under the GNU General Public License v3.0 or later - see the LICENSE file for details.