Hummingbird Agent Design

Architectural design document for the hummingbird-agent. Covers the reasoning behind every major design choice so that future changes can be made safely, without accidentally violating invariants that hold the system together.

For operational usage (CLI, config reference, deployment), see Hummingbird Agent. For the model loop wire format, see Agent Model Loop.

1. Design Philosophy

Five principles shaped the agent’s architecture. Every component traces back to at least one of these.

Security by default. The agent processes untrusted merge requests. All LLM-driven commands run inside isolated sandbox containers with no network and no credentials. Orchestrator write tokens and model read tokens live in separate code paths that never cross. No cluster-admin is required.

Config over code. Investigation logic lives in markdown workflow files that become the LLM system prompt verbatim – changing the analysis strategy requires no code change and no redeployment. Operational settings (data sources, project allowlists, iteration limits) live in a YAML config file that is auditable and committable. Secrets are the only thing in environment variables.

Bounded cost. Every output path has a size cap. Large tool outputs are auto-spilled to sandbox files with only a preview returned to the model. Each session has both an iteration limit and a per-call context token limit with two-tier escalation (soft warning, then hard stop with tool disabling). The relationship between these constants is documented and centralized.

Partial over nothing. When some data is unavailable (expired pod logs, unreachable Konflux cluster, Testing Farm 404), the agent continues with whatever data it has and notes the gap in its output. A partial report is more valuable than a crash.

Model-agnostic agent loop. The agent loop (agent.py) does not inspect the internal structure of the contents list. It appends raw_content from model responses and make_tool_responses() output without looking inside. Each model backend owns its wire format. This makes it possible to add new model backends (Claude, GPT) without touching the agent loop.

2. System Architecture

2.1. Component overview

flowchart TB
    subgraph input [Event Sources]
        SQS["SQS Queue<br/>(gitlab::pipeline, gitlab::note)"]
        CLI["CLI<br/>(--event / --event-file)"]
    end

    subgraph orchestrator [Orchestrator Process]
        Events["events.py<br/>SQS consumer"]
        Runner["runner.py<br/>Event routing"]
        Agent["agent.py<br/>Model loop"]
        Tools["tools.py<br/>Tool registry"]
        Actions["actions.py<br/>GitLab notes"]
        Sessions["sessions.py<br/>Persistence"]
        WfConfig["workflow_config.py<br/>YAML config"]
    end

    subgraph models [Model Backends]
        Gemini["models/gemini.py<br/>Gemini API / Vertex AI"]
        Anthropic["models/anthropic.py<br/>Anthropic via Vertex AI"]
    end

    subgraph sandbox [Sandbox Container]
        SB["PodmanSandbox / K8sSandbox<br/>jq, python3, yq"]
        SpillFiles["/tmp/data/_out/ spill files"]
        DataFiles["/tmp/data/ working files"]
    end

    subgraph dataSources [Data Source Modules]
        GL["gitlab.py"]
        KX["konflux.py"]
        TF["testing_farm.py"]
    end

    subgraph external [External APIs]
        GitLabAPI["GitLab API"]
        KonfluxAPI["K8s + Kubearchive"]
        TFAPI["Testing Farm API"]
    end

    subgraph storage [Storage]
        S3["S3 Sessions"]
        MRNote["GitLab MR Notes"]
    end

    SQS --> Events --> Runner
    CLI --> Runner
    Runner --> Agent
    Agent --> Tools
    Tools --> SB
    Tools --> dataSources
    dataSources --> external
    Agent --> Gemini
    Agent --> Anthropic
    Runner --> Actions --> MRNote
    Runner --> Sessions --> S3
    Runner --> WfConfig

2.2. Request lifecycle

A complete run proceeds through these stages:

  1. Event arrival. An SQS message (production) or CLI invocation (dev) provides a GitLab project path and MR IID.

  2. Config lookup. workflow_config.py maps the project to one or more workflows via the project_index. Each match yields a WorkflowConfig and ProjectEntry with data source declarations and per-project settings.

  3. Rate-limit check and SHA dedup. actions.scan_agent_threads() scans MR discussions in a single pass, parsing JSON session markers to determine the per-workflow thread count and whether the current SHA has already been reviewed. If the SHA was already reviewed, the workflow is skipped. If the thread count meets or exceeds max_runs_per_mr, the workflow is skipped. Slash commands and replies bypass this check.

  4. Placeholder note. actions.create_placeholder_note() posts a placeholder so the reviewer knows analysis is in progress. The note contains a JSON session marker (<!-- hummingbird-session: {"id":"UUID","wf":"name","sha":"abc"} -->) for rate limiting, SHA dedup, and session resumption.

  5. Sandbox start. sandbox.create_sandbox() starts a Podman container or K8s pod. /tmp/data/ is pre-created for the model’s use.

  6. Data source registration. data_sources.register_selected() resolves tokens and URLs from the config and registers tool functions on the ToolRegistry. Only data sources declared in the workflow config are registered.

  7. Agent loop. agent.run_agent_loop() runs the model loop: the workflow markdown becomes the system prompt, tool definitions are provided, and the model iterates calling tools and producing text until it emits a final response or hits a budget limit.

  8. Note update. The final text, wrapped with a session marker and reply prompt, replaces the placeholder note.

  9. Session save. Conversation history (contents), transcript, and sandbox archive are saved to S3 (production) or a local directory (dev).

  10. Sandbox cleanup. The container/pod is deleted. K8s pods also have a configurable activeDeadlineSeconds backstop (default 1800s) in case the orchestrator dies.

2.3. Architectural boundaries

The codebase is organized around four strict boundaries:

Runner (runner.py) is the orchestration layer. It owns event routing, the placeholder/update note lifecycle, session save/load, and sandbox lifecycle. It calls agent.run_agent_loop() but never reaches into the agent’s internals.

Agent (agent.py) is the model loop. It knows about the model interface, tool definitions, and the contents list, but nothing about GitLab, SQS, sessions, or actions. It returns (text, usage, transcript, contents) and is completely unaware of what happens with those values.

Tools (tools.py) bridge the agent and the sandbox/data-sources. The agent calls tool_registry.execute(tool_call) and gets a dict back. It never calls sandbox methods directly. This indirection is what enables auto-spill: the tool registry can transparently save large outputs to sandbox files and return previews.

Models (models/) own wire format conversion. Each model adapter’s generate() accepts internal types (contents, ToolDef) and returns a ModelResponse. make_user_content() and make_tool_responses() produce the model-specific dicts that go into contents (Gemini and Anthropic backends each implement the ModelAdapter protocol). The agent treats these as opaque values – it appends them but never inspects their internal structure.

3. Module Architecture

3.1. Dependency graph

flowchart TD
    main["__main__.py"]
    config_mod["config.py"]
    wf_config["workflow_config.py"]
    events_mod["events.py"]
    runner_mod["runner.py"]
    agent_mod["agent.py"]
    workflow_mod["workflow.py"]
    tools_mod["tools.py"]
    sandbox_mod["sandbox.py"]
    actions_mod["actions.py"]
    sessions_mod["sessions.py"]
    transcript_mod["transcript.py"]
    ds_init["data_sources/__init__.py"]
    ds_gitlab["data_sources/gitlab.py"]
    ds_konflux["data_sources/konflux.py"]
    ds_tf["data_sources/testing_farm.py"]
    http_mod["_http.py"]
    config_watch_mod["config_watch.py"]
    models_init["models/__init__.py"]
    models_types["models/_types.py"]
    models_gemini["models/gemini.py"]
    models_anthropic["models/anthropic.py"]

    main --> config_mod
    main --> config_watch_mod
    main --> wf_config
    main --> events_mod
    main --> runner_mod
    main --> sessions_mod
    main --> transcript_mod
    main --> actions_mod

    config_watch_mod --> config_mod
    config_watch_mod --> wf_config

    runner_mod --> actions_mod
    runner_mod --> agent_mod
    runner_mod --> config_mod
    runner_mod --> ds_init
    runner_mod --> sandbox_mod
    runner_mod --> sessions_mod
    runner_mod --> tools_mod
    runner_mod --> workflow_mod
    runner_mod --> wf_config
    runner_mod --> transcript_mod

    agent_mod --> config_mod
    agent_mod --> models_init
    agent_mod --> tools_mod

    tools_mod --> config_mod
    tools_mod --> models_init
    tools_mod --> sandbox_mod

    workflow_mod --> config_mod
    workflow_mod --> models_init

    ds_init --> tools_mod
    ds_init --> wf_config
    ds_init --> ds_gitlab
    ds_init --> ds_konflux
    ds_init --> ds_tf

    ds_konflux --> http_mod

    sessions_mod --> sandbox_mod

    models_init --> models_types
    models_init --> models_gemini
    models_init --> models_anthropic
    models_gemini --> http_mod
    models_gemini --> models_types
    models_anthropic --> http_mod
    models_anthropic --> models_types

3.2. Module responsibilities

Each module has a single, well-defined responsibility. The boundary rules below are invariants – violating them breaks the security model or the separation of concerns.

config.py – Runtime configuration and constants

  • Owns: Config dataclass, estimate_cost(), all tuning constants (DEFAULT_MAX_ITERATIONS, CONTEXT_LIMIT, OUTPUT_PREVIEW_BYTES, etc.)
  • Boundary: Pure data. No I/O except reading env vars in Config.load(). No imports from other hummingbird modules.
  • Invariant: All interdependent budget/spill constants must be defined here with their relationship documented in the comment block.

workflow_config.py – YAML config loader and token resolution

  • Owns: AgentConfig, WorkflowConfig (with trigger, description, ignore_users, ignore_branches fields), ProjectEntry, DataSourceConfig dataclasses; load(), validate_env(), resolve_tool_token(), resolve_cluster_url(), get_prompt().
  • Boundary: Reads YAML from disk and env vars. Never instantiates network clients or model objects.
  • Invariant: resolve_tool_token() resolves model tool tokens only. It must never return orchestrator tokens. The resolution chain is: project_entry.tokens[ds] -> workflow_cfg.data_sources[ds].token_env -> "" (empty).

events.py – SQS consumer

  • Owns: decode_sns_message() (SNS envelope decoding), poll_loop() (blocking SQS consumer with concurrency control).
  • Boundary: Knows about SQS/SNS wire formats. Calls a generic EventHandler callback. Does not know about GitLab, workflows, or the agent.
  • Invariant: Failed messages are not deleted from SQS (they go to the DLQ after visibility timeout expires). Successful messages are deleted after handler returns.

config_watch.py – Background config hot-reload

  • Owns: ConfigHolder (thread-safe config pair with atomic swap), start_watcher() (daemon thread that polls config file mtime), _watch_loop().
  • Boundary: Knows about config.Config.load() and workflow_config.load(). Does not know about events, the agent, or any runtime state.
  • Invariant: On reload failure (parse error, missing env var), the previous config is kept and a warning is logged. The watcher never crashes the serve loop.

runner.py – Event routing and workflow execution

  • Owns: handle_event(), handle_pipeline(), handle_merge_request(), handle_note(), _execute_workflow(), _handle_reply(), run_workflow(), _acquire_sandbox(), _resolve_slash_workflow(), _format_help(), _is_ignored_user(), WorkflowRequest, WorkflowResult, SandboxOpts.
  • Boundary: Orchestrates everything: config lookup, rate limiting, placeholder notes, sandbox lifecycle, agent invocation, session save. This is the only module that touches both actions and agent.
  • Invariant: run_workflow() either lingers the sandbox (success) or cleans it up (exception).

agent.py – Model-agnostic tool-calling loop

  • Owns: run_agent_loop(), system prompt constants (BASE_SYSTEM_PROMPT, CONTINUATION_PROMPT, warning/nudge strings), budget logic.
  • Boundary: Knows about models (the ModelAdapter protocol: generate() and content construction) and tools (for execute()). Does not know about GitLab, SQS, sessions, or actions. Does not know which model backend is in use.
  • Invariant: The contents list is treated as opaque. Items are appended via response.raw_content and model.make_tool_responses(). The agent never inspects, modifies, or deletes items in the list (except for empty response retries, which pop() the last appended item before the model has seen any tool results for it).

tools.py – Tool registry

  • Owns: ToolRegistry class with sandbox_exec, fetch_to_sandbox, fetch_batch_to_sandbox built-in tools; data source registration and dispatch; auto-spill logic.
  • Boundary: Owns the sandbox reference and all tool execution. The agent never touches the sandbox directly.
  • Invariant: _spill_counter is shared across all spill paths (sandbox exec stdout, sandbox exec stderr, data source auto-spill) to prevent filename collisions in /tmp/data/_out/.

sandbox.py – Sandbox backends

  • Owns: Sandbox protocol, PodmanSandbox, K8sSandbox, K8sPoolSandbox, SandboxPool, create_sandbox(), ExecResult.
  • Boundary: Knows about container/pod lifecycle and command execution. Does not know about tools, models, or the agent.
  • Invariant: All backends must implement the same 8-method protocol (start, exec, write_file, stream_to_file, stream_exec, read_file_iter, cleanup, linger). exec() accepts optional stdin_data for piping raw bytes. All must pre-create /tmp/data/ in start(). All must use stdin piping for write_file() (never host volume mounts).

actions.py – GitLab note lifecycle

  • Owns: Orchestrator token resolution (resolve_orchestrator_token()), note CRUD, JSON marker parsing (marker_tag, parse_marker), scan_agent_threads (per-workflow rate limiting + SHA dedup), post_simple_reply, member access checks, session-for-reply lookup, AgentThreadInfo, SessionRef.
  • Boundary: Uses orchestrator tokens exclusively. Never touches model tool tokens, the tool registry, or the agent.
  • Invariant: Orchestrator tokens are resolved from ORCHESTRATOR_* env vars only, never from the YAML config. The ORCHESTRATOR_ prefix makes them structurally impossible to confuse with model tokens.

sessions.py – Session persistence

  • Owns: archive_sandbox(), save_local(), save_s3(), load_s3(), load_local(), restore_sandbox(), SessionData (including format_version for canonical vs legacy session JSON).
  • Boundary: Knows about sandbox (for archiving) and S3/filesystem (for storage). Does not know about the agent, tools, or models. Load/save detects canonical envelope (format_version: 1) vs legacy raw contents lists.
  • Invariant: The sandbox archive is never extracted on the orchestrator. It is created inside the sandbox (tar czf), streamed out via read_file_iter(), and restored inside a new sandbox via stream_exec("tar xzf -").

transcript.py – Transcript rendering

  • Owns: render_markdown(), per-tool rendering functions, truncation.
  • Boundary: Pure transformation from TranscriptEntry list to markdown. No I/O, no side effects.

data_sources/__init__.py – Data source registration

  • Owns: register_selected() – resolves tokens/URLs from config and calls each data source’s register() function.
  • Boundary: Bridges workflow_config (for token/URL resolution) and tools (for registration). Only registers data sources declared in the workflow config.

data_sources/gitlab.py, konflux.py, testing_farm.py

  • Own: register() function that creates tool definitions and closures over credentials, and registers them on the ToolRegistry.
  • Boundary: Each module talks to one external API. Credentials are captured at registration time via closure, never stored globally.

models/_types.py – Shared data classes

  • Owns: ModelError, ToolDef, ToolCall, Usage, ModelResponse, TranscriptEntry; ModelAdapter protocol (implemented by model backends).
  • Boundary: Pure data. No imports from other hummingbird modules.

_http.py – Shared HTTP infrastructure

  • Owns: new_session() (requests session with retry adapter and configurable 429 handling), VertexAuth (Google ADC credentials).
  • Boundary: Used by model backends (models/gemini.py) and data sources (data_sources/konflux.py). No model-specific or data-source- specific logic.

models/gemini.py – Gemini model adapter

  • Owns: GeminiModel with generate(), make_user_content(), make_tool_responses(), to_canonical(), from_canonical().
  • Boundary: Translates between internal types and the Gemini REST API. Supports API key (direct) and Vertex AI (ADC) authentication modes.
  • Note: Parses cachedContentTokenCount from Gemini’s implicit server-side caching into Usage.cache_read_tokens.

models/anthropic.py – Anthropic model adapter (Vertex AI)

  • Owns: AnthropicVertexModel with generate(), make_user_content(), make_tool_responses(), to_canonical(), from_canonical().
  • Boundary: Translates between internal types and the Anthropic Messages API via Vertex AI rawPredict.
  • Note: make_tool_responses() uses _last_tool_ids instance state to pair tool results with tool-use IDs.
  • Note: generate() merges consecutive user messages before sending (Anthropic requires strict user/assistant alternation).
  • Note: generate() annotates messages with sliding-window cache breakpoints (cache_control) on shallow copies to avoid mutating contents. The ephemeral parameter skips cache writes for transient warning messages.

4. Security Model

The agent runs in a shared OpenShift cluster without cluster-admin access, processing merge requests from repositories where external contributors can submit code. The security model addresses two threat vectors: (1) the LLM executing arbitrary commands chosen by an attacker-controlled MR, and (2) credential leakage between the model, the sandbox, and orchestrator actions.

4.1. Sandbox isolation

All commands generated by the LLM run inside an ephemeral container, never on the orchestrator host. The sandbox has no credentials, no network, and no visibility into the orchestrator.

flowchart LR
    subgraph orchestrator [Orchestrator]
        Agent["Agent Loop"]
        Creds["Secrets<br/>(tokens, kubeconfig)"]
    end
    subgraph sandbox [Sandbox Container]
        Shell["sh -c commands"]
        Files["/tmp/data/<br/>(incl. _out/ spill dir)"]
    end
    Agent -->|"stdin pipe<br/>(write_file)"| sandbox
    Agent -->|"exec command"| sandbox
    sandbox -->|"stdout/stderr"| Agent
    sandbox -.-x|"NO network"| Internet["Internet / K8s API"]
    sandbox -.-x|"NO access"| Creds

Podman (local development):

  • --network=none – complete network isolation; curl, wget, pip install all fail
  • --user 65532 – fixed non-root UID; no privilege escalation
  • No host volume mounts – data enters only via stdin piping through write_file()

Kubernetes (production):

Every field in the pod manifest is set explicitly for portability to vanilla Kubernetes with Pod Security Admission (restricted level), not just reliance on OpenShift’s restricted-v2 SCC admission:

  • automountServiceAccountToken: false – no K8s API access from sandbox
  • runAsNonRoot: true – enforced at pod level
  • seccompProfile: RuntimeDefault – required by restricted level
  • allowPrivilegeEscalation: false – container level
  • capabilities.drop: ["ALL"] – container level
  • activeDeadlineSeconds: 1800 – pod self-terminates after 30 minutes even if the orchestrator crashes; prevents orphaned pods
  • restartPolicy: Never – pod does not restart on failure
  • NetworkPolicy on the sandbox namespace blocks all egress from all pods in the namespace (podSelector: {}, egress: [])

The pod does not set runAsUser explicitly. On OpenShift, the SCC assigns a UID from the namespace range. On vanilla K8s, the image’s USER directive is used.

4.2. Credential separation (three-tier token model)

Tokens are split into three tiers with strict code-path separation:

flowchart TB
    subgraph tier1 [Tier 1: Orchestrator Tokens]
        OT["ORCHESTRATOR_GITLAB_TOKEN_*<br/>Write scope (api)<br/>Env vars only, never in YAML"]
    end
    subgraph tier2 [Tier 2: Model Tool Tokens]
        MT["GITLAB_TOKEN_RO, etc.<br/>Read scope (read_api)<br/>Env var NAMES in YAML"]
    end
    subgraph tier3 [Tier 3: Sandbox]
        SB["Zero credentials<br/>Zero network<br/>Zero SA token"]
    end

    OT -->|"used by"| Actions["actions.py<br/>(notes, rate limits)"]
    MT -->|"used by"| DataSources["data_sources/<br/>(GitLab, Konflux, TF)"]
    SB -->|"used by"| SandboxExec["sandbox_exec<br/>(sh -c commands)"]

    Actions -.-x|"NEVER"| DataSources
    Actions -.-x|"NEVER"| SandboxExec
    DataSources -.-x|"NEVER"| Actions
    DataSources -.-x|"NEVER"| SandboxExec

Tier 1 – Orchestrator tokens. Write-capable GitLab tokens (api scope) used by actions.py for posting/editing notes, rate-limit counting, and member access checks. Resolved by convention from env vars: ORCHESTRATOR_GITLAB_TOKEN_<MANGLED_PROJECT> (per-project) or ORCHESTRATOR_GITLAB_TOKEN (global fallback). The ORCHESTRATOR_ prefix is a structural safeguard – these env var names can never appear in the YAML config’s data_sources or tokens sections because those sections reference model tool token names, which never start with ORCHESTRATOR_.

Tier 2 – Model tool tokens. Read-only tokens (read_api scope for GitLab) used by data source modules during tool execution. Declared in the YAML config as env var names (not values):

data_sources:
  gitlab:
    token_env: GITLAB_TOKEN_RO          # name of the env var

Per-project overrides are possible:

projects:
  redhat/hummingbird/containers:
    tokens:
      gitlab: GITLAB_TOKEN_CONTAINERS_RO  # overrides token_env for this project

The resolution chain in resolve_tool_token() is: project_entry.tokens[ds] -> workflow_cfg.data_sources[ds].token_env -> ""

Storing env var names (not values) in YAML means the config file is safe to commit and audit. Actual secret values live in environment variables, injected via K8s Secrets at deployment time.

Tier 3 – Sandbox. The sandbox container has zero credentials, zero network access, and automountServiceAccountToken: false (no K8s API access). Data enters the sandbox only via write_file() (stdin piping). The model cannot instruct the sandbox to reach external APIs – it must use the orchestrator’s data source tools.

4.3. Namespace separation

Sandbox pods run in a dedicated namespace (hummingbird--agent-sandbox), separate from the orchestrator namespace (hummingbird--internal). This limits blast radius: even if a sandbox pod is compromised, it has no visibility into the orchestrator’s Secrets, Pods, or ServiceAccount tokens.

RBAC setup:

The orchestrator’s ServiceAccount gets a Role in the sandbox namespace (not its own namespace) granting only:

  • pods: create, get, list, delete, patch – sandbox pod lifecycle and pool claims
  • pods/exec: create – command execution via kubectl exec

No CRDs, no custom runtimes, no cluster-scoped resources. The orchestrator needs only namespace-scoped permissions, so it works with standard OpenShift RBAC without requesting cluster-admin.

Konflux data is fetched via bearer token from kubeconfig credentials, not from inside the cluster. The orchestrator’s ServiceAccount does not need access to Konflux namespaces.

NetworkPolicy:

A deny-all-egress NetworkPolicy in the sandbox namespace uses podSelector: {} to match all pods and sets egress: []. The sandbox cannot reach the internet, the K8s API, or other pods in the cluster.

4.4. Security invariants

These must hold for the security model to be effective. Any change that violates one of these is a security regression:

  1. Orchestrator tokens must NEVER flow into ToolRegistry, data_sources, or model contents. They are resolved in actions.py only.

  2. Model tool tokens must NEVER flow into actions.py. They are resolved in workflow_config.py and consumed in data_sources/.

  3. The sandbox must NEVER have network access or credentials. No host mounts, no SA token, no env vars with secrets.

  4. The sandbox archive is NEVER extracted on the orchestrator. It is created inside one sandbox and restored inside another. The orchestrator only transports the bytes.

  5. No cluster-admin required. Only namespace-scoped resources (Role, RoleBinding, Pod, NetworkPolicy) are used.

  6. Agent-generated notes are skipped on re-processing. Notes containing SESSION_MARKER_PREFIX are filtered out in handle_note() to prevent infinite loops.

  7. All auto-triggers require Developer access. handle_pipeline(), handle_merge_request(), and handle_note() each check check_member_access() on the event’s user before executing a workflow. Events from non-developers are silently skipped (or replied to with an access-denied message for slash commands). This ensures that the target MR’s work products – its diff, description, CI logs, and commit messages – originate from a trusted author. Fork MRs from external contributors are blocked because the MR author lacks Developer+ on the target project.

    Scope limitation: this gate only covers the target MR. Once the agent is running, the model controls tool arguments and can direct tools at content beyond the target MR – other MRs, other refs, other job IDs, even other projects reachable by the read-only token. The current mitigations are: (a) workflow prompts instruct the model to use the event’s {project, iid, sha}, (b) the model would need to be manipulated via prompt injection from already-trusted content to deviate, (c) the sandbox prevents the model from acting on manipulated reasoning beyond producing text output, and (d) the token is read-only with minimal scope. However, if a data source tool is added that returns user-authored prose from arbitrary resources (e.g. issue bodies, wiki pages), it should apply per-author trust filtering (invariant #9).

  8. Data source tools that accept URLs must validate them against an allowlist. tf_get_test_log restricts URLs to the Testing Farm artifacts prefix to prevent the LLM from directing the orchestrator to fetch arbitrary URLs (SSRF).

  9. Third-party commentary entering the model prompt must be trust-filtered. Data sources that feed text from users other than the event trigger into contents (discussion comments, issue bodies) must gate on project membership at Developer+ level per author. Content from untrusted authors must be redacted to a fixed placeholder, never sanitized or escaped – there is no reliable way to escape adversarial text for an LLM prompt. Discussions where all notes are untrusted must be dropped entirely. See §9.1 for the reference implementation in gitlab_get_mr_discussions.

    This invariant covers commentary (what other people said about the MR), not work products (the diff, CI logs, commit messages). Work products are inherently the input to the agent’s analysis and cannot be content-filtered without defeating the agent’s purpose. For the target MR, their trust comes from invariant #7 (the event trigger is Developer+). For content the model fetches beyond the target MR, trust depends on the tool: discussion tools must filter per-author (#9), while code/log/metadata tools rely on the mitigations described in #7.

5. Configuration System

5.1. Two sources, strict separation

Configuration comes from exactly two sources with no overlap:

  • YAML config file (CONFIG_PATH): operational settings, workflow definitions, project allowlists, data source declarations, and token env var names. This file is safe to commit, review, and audit.
  • Environment variables: secrets only (API keys, GitLab tokens, kubeconfig paths). These are injected at deployment time via K8s Secrets.

This split exists because YAML provides structure, validation, and audit trails, while secrets must stay out of version control.

5.2. Config loading

Two config objects are built at startup:

Config (from config.py): loaded by Config.load(settings), where settings is the settings: section from the YAML file. Auth/bootstrap fields come from env vars (GOOGLE_API_KEY, GOOGLE_CLOUD_PROJECT, etc.). Operational fields come from the settings dict with sensible defaults.

AgentConfig (from workflow_config.py): loaded by workflow_config.load(path). Contains all workflow definitions, project entries, and a pre-built project_index.

The two are kept separate because Config is needed everywhere (model construction, sandbox creation, session storage), while AgentConfig is only needed for event routing and data source registration.

5.3. YAML config structure

settings:
  gitlab_url: https://gitlab.com         # operational, not a secret
  sandbox:
    image: quay.io/.../image:tag
    namespace: my-namespace              # K8s only
  model: gemini-3.1-pro-preview
  # model: claude-sonnet-4-5@20250929   # Anthropic via Vertex AI (alternative)
  max_iterations: 30
  max_runs_per_mr: 5
  internal_notes: true
  max_concurrent_agents: 4
  sqs_queue_url: ""                      # empty = no SQS (local dev)
  s3_session_bucket: ""                  # empty = no S3 (local dev)
  trigger_prefix: /hummingbird           # slash command prefix
  session_marker_prefix: hummingbird-session  # HTML comment marker ID
  auto_trigger: true                     # auto-run on failed pipelines

workflows:
  analyze-failures:
    prompt: workflows/analyze-failures.md  # relative to config file dir
    action: post_gitlab_note
    model: gemini-3.1-pro-preview                  # per-workflow override
    max_iterations: 50                     # per-workflow override

    data_sources:
      gitlab:
        token_env: GITLAB_TOKEN_RO         # env var name, not the value
      konflux:
        cluster_url: https://example.com:6443/ns/my-tenant
        kubeconfig_env: KUBECONFIG
        kubearchive_url: https://kubearchive-api-server-product-kubearchive.apps.example.com
      testing_farm: {}

    projects:
      redhat/hummingbird/containers:
        tokens:                            # per-project token overrides
          gitlab: GITLAB_TOKEN_CONTAINERS_RO

Design choices in this structure:

Workflow-first organization. Each workflow owns its project list, not the other way around. This scopes data source permissions per workflow-project pair. A future code-review workflow can have different GitLab tokens (with different scopes) than the analyze-failures workflow, with no ambiguity.

Token env var names in YAML (not values). The YAML file is committed to the repo. It contains token_env: GITLAB_TOKEN_RO (a name), not the actual token. The actual secret value is resolved at runtime via os.environ.get(env_var). This allows the config to be reviewed and audited without exposing secrets.

Inline cluster_url only. The Konflux cluster URL is operational configuration (it identifies which cluster to talk to), not a secret. Putting it inline in YAML makes it visible and auditable. The URL must include the namespace path (e.g. /ns/my-tenant).

5.4. Project index

At load time, _build_project_index() constructs a reverse lookup:

project_index: dict[str, list[tuple[str, WorkflowConfig, ProjectEntry]]]
# e.g. {"redhat/hummingbird/containers": [("analyze-failures", wf_cfg, proj_entry)]}

This provides O(1) lookup when a pipeline event arrives with a project path. A single project can appear in multiple workflows (e.g. both analyze-failures and a future code-review), and each matching workflow will be triggered independently.

5.5. Token resolution chains

Model tool tokens (for data source API calls):

resolve_tool_token(project_entry, ds_name, workflow_cfg):
  1. project_entry.tokens[ds_name]           -> per-project override
  2. workflow_cfg.data_sources[ds_name].token_env  -> workflow default
  3. "" (empty string)                        -> no token
  Each step resolves the env var NAME, then reads os.environ[name].

Orchestrator tokens (for GitLab note operations):

resolve_orchestrator_token(project_path):
  1. ORCHESTRATOR_GITLAB_TOKEN_<MANGLED_PROJECT>  -> per-project
  2. ORCHESTRATOR_GITLAB_TOKEN                     -> global fallback
  Mangling: "/" -> "_", "-" -> "_", uppercase.
  e.g. "redhat/hummingbird/containers" -> ORCHESTRATOR_GITLAB_TOKEN_REDHAT_HUMMINGBIRD_CONTAINERS

Cluster URL (for Konflux):

resolve_cluster_url(ds_cfg):
  -> ds_cfg.cluster_url (inline value in YAML)

5.6. Environment validation

validate_env(agent_cfg) checks at startup that all referenced env vars exist. It walks every workflow’s data sources and project token overrides, collecting missing vars into a single error message. This catches configuration errors early instead of failing mid-run when a specific data source is first used.

6. Agent Loop Design

For the wire-format walkthrough (what bytes go to the model API, what comes back), see Agent Model Loop. This section covers the design rationale behind the loop.

6.1. Full history replay

The Gemini and Anthropic APIs are stateless. Every call sends the complete contents list from the beginning of the conversation. This means every large tool output sitting in history inflates every subsequent API call.

This property is fundamental to why auto-spill exists (section 7). Without auto-spill, a single cat of a 32KB file early in the conversation adds ~8K tokens to every remaining API call. Over a 15-iteration run, that is ~120K wasted tokens.

The alternative – conversation compaction (replacing old tool results with summaries) – was considered and deferred. Each provider has strict requirements about content structure (e.g. Gemini model turns must match the preceding tool turns; Anthropic enforces user/assistant alternation), and modifying history risks confusing the model or violating API constraints. Auto-spill solves 90% of the problem with none of the risk.

6.2. Budget model: iterations + context limit

The agent uses a dual-limit approach rather than a cumulative token budget:

Iteration limit (max_iterations, default 30 per workflow config). Hard cap on the number of model round-trips. This is the primary cost control lever. With auto-spill keeping per-call context bounded, iteration count is roughly proportional to cost.

Per-call context limit (CONTEXT_LIMIT, default 60,000 tokens). Checked after each API call using response.usage.input_tokens. This is a safety net for cases where auto-spill is not sufficient (e.g., many small tool results that individually stay under the spill threshold but cumulatively fill the context).

Why not a cumulative token budget? Because with full history replay, each API call re-sends everything. “Cumulative billed tokens” double-counts: call 1 sends 5K, call 2 sends 10K (including the 5K again), so billed total is 15K but actual new content is only 10K. Iteration count is a simpler and more predictable proxy for cost.

6.3. Two-tier budget escalation

Both limits use the same escalation pattern:

flowchart LR
    Normal["Normal<br/>tools available"]
    Soft["Soft Warning (80%)<br/>ITERATION_WARNING or<br/>CONTEXT_WARNING<br/>tools still available"]
    Hard["Hard Stop (100%)<br/>FINAL_TURN_WARNING<br/>toolConfig: NONE"]

    Normal -->|"80% reached"| Soft
    Soft -->|"100% reached"| Hard

Soft warning (80%). An ephemeral user message (ITERATION_WARNING) is injected into contents for that turn only, then popped before the response is persisted. Tools remain available so the model can finish in-progress work.

Hard stop (100%). FINAL_TURN_WARNING is injected as an ephemeral user message AND tool_defs is set to [] (empty list) to physically prevent further tool calls. The model must produce text. This is more reliable than disabling tools at the API layer (toolConfig.functionCallingConfig.mode: NONE on Gemini, tool_choice: {"type": "none"} on Anthropic), which models sometimes ignore (Gemini may return UNEXPECTED_TOOL_CALL).

6.3.1 Ephemeral messages and Anthropic alternation

All per-turn warnings use ephemeral user messages: they are appended to contents before the API call and popped immediately after. This keeps the system prompt stable across all iterations and prevents warnings from polluting the conversation history saved in sessions.

Anthropic requires strict user/assistant alternation. The Anthropic adapter’s generate() merges consecutive user messages on a copy of contents before sending the request, so ephemeral warnings (and other adjacent user turns) do not break the API contract.

6.4. Empty response handling

Models occasionally return empty responses (no text, no tool calls). The agent retries up to MAX_EMPTY_RETRIES (2) times:

  1. Pop the empty response from contents. It adds nothing and may confuse the model on the next call.
  2. Inject EMPTY_RESPONSE_NUDGE as an ephemeral user message on the next turn: “Your previous response was empty. Please continue…”
  3. Continue the loop (consuming an iteration).

The nudge is delivered as an ephemeral user message (injected before the API call and popped after). This avoids mutating the system prompt and keeps the conversation history clean for session persistence.

The empty_retries counter resets to 0 after any successful iteration (one where tool calls were executed). This means the model gets fresh retries if it produces empty responses at different points in the conversation.

MALFORMED_FUNCTION_CALL handling. Models sometimes return a finishReason of MALFORMED_FUNCTION_CALL (Gemini) with no usable tool calls. This is treated as a special case of empty response: the retry mechanism kicks in, but the nudge is replaced with MALFORMED_CALL_NUDGE which tells the model to retry with simpler arguments and avoid large text payloads in tool call arguments.

6.5. Error recovery

Model API errors. _generate_with_retry() catches ModelError and retries up to MODEL_RETRY_COUNT (4) times if retryable is True (HTTP 5xx and 429). Non-retryable errors (4xx, auth failures) fail immediately. Retries use exponential backoff: MODEL_RETRY_BASE_DELAY * 2^attempt, capped at MODEL_RETRY_MAX_DELAY (60s), giving delays of 5s, 10s, 20s, 40s. Transport-level 429 retry is disabled on the model’s HTTP session (retry_429=False) so that rate-limit responses are handled at the model retry layer with proper backoff instead of being silently retried by urllib3.

Transport errors. Each model adapter’s generate() catches requests.Timeout and requests.ConnectionError from the HTTP call and wraps them as retryable ModelError. This makes timeouts (read and connect) and connection failures subject to the same retry logic as HTTP 5xx/429. Timeout is caught before ConnectionError because ConnectTimeout inherits from both. Note that the urllib3 retry adapter does not retry POST requests (not idempotent), so transport errors from model calls always propagate to our code.

Unexpected loop errors. run_agent_loop wraps the _generate_with_retry call in a try/except Exception that logs and breaks instead of propagating. This ensures the function always returns partial results (accumulated contents, transcript, sandbox state) even when an unexpected exception occurs (e.g. JSONDecodeError from a truncated API response). The caller saves the session normally – conversation history and sandbox archive are preserved for resumption. This follows the “partial over nothing” principle: 32 iterations of work are more valuable than a crash.

No failure-path session save. _execute_workflow does not save a session when run_workflow raises. With the agent loop catching unexpected errors, the failure path only fires for infrastructure errors (sandbox start, config) where there is no useful state. Not saving avoids overwriting a previous good session when a reply attempt fails.

Tool execution errors. _execute_tool_calls() wraps each tool_registry.execute() in a try/except. Unhandled exceptions are caught and returned to the model as {"error": "Tool X failed: ..."}. This prevents a single broken tool from crashing the entire session – the model sees the error and can adapt.

Sandbox exec timeout. If a command exceeds EXEC_TIMEOUT (120s), the TimeoutExpired exception is caught in the tool registry and returned as a structured error dict. The model can retry with a different command or proceed without the result.

6.6. Session resumption

When resuming from a previous session:

  1. initial_contents (the full contents list from the previous run) is prepended, followed by the user’s reply as a new user turn.
  2. CONTINUATION_PROMPT is appended to the system prompt.
  3. The sandbox is restored from the archived tarball.
  4. A fresh iteration budget starts from 0.

Saved sessions use a versioned JSON envelope (format_version: 1) whose contents are in a canonical (OpenAI-style) message format, independent of whether the run used Gemini or Anthropic. On resume, run_workflow converts initial_contents into the active model’s native wire format: for format_version >= 1, via model.from_canonical(); for legacy sessions without a version, GeminiModel.to_canonical() migrates Gemini-native history to canonical form first, then model.from_canonical() loads it into the current backend. After a successful run, model.to_canonical() converts the native contents back to canonical form before persistence.

CONTINUATION_PROMPT is critical. Without it, the workflow prompt (e.g., analyze-failures.md) tells the model to follow a rigid multi-phase workflow: Data.1, Data.2, Analysis.1… The model would try to re-run the entire analysis. The continuation prompt overrides this: “Do NOT re-run the full workflow. Respond directly to the user’s question.”

The practical constraint on resumed sessions is the context window, not iterations. The restored history from an 8-iteration cold start uses ~50-100K input tokens. Each new tool call adds more. The existing CONTEXT_LIMIT check still applies and will force wrap-up if the context grows too large.

Canonical session format (rationale)

Storing sessions in canonical form decouples persisted history from any one provider’s JSON shape. The same saved thread can be resumed on a different model adapter (including cross-provider migration) because conversion happens at load and save boundaries only; the agent loop continues to treat native contents as opaque between those steps.

6.7. Prompt caching

Since the agent replays the full conversation history on every API call (see 6.1), prompt caching reduces the cost of re-processing unchanged prefixes. The two model backends handle caching differently.

Gemini. Vertex AI caches prefixes implicitly on the server side. The adapter parses cachedContentTokenCount from usageMetadata and reports it as cache_read_tokens. No request-side changes are needed.

Claude. Vertex AI does not support Anthropic’s automatic caching (which requires opt-in at the API level and is not yet available on Vertex). The adapter uses explicit cache_control: {"type": "ephemeral"} annotations on message content blocks. Up to 4 breakpoint slots are available per request; the agent uses 2.

Why 2 breakpoints on messages, not 4 on system/tools/messages. The Anthropic prefix hash is cumulative: it covers everything from the start of the request (tools, system prompt, messages) up to the annotated block. A single breakpoint on a message already caches the entire prefix including tools and system. Separate breakpoints on earlier components would be redundant and waste slots. Using only 2 of the 4 slots leaves room for future use.

Why a sliding window. The conversation grows by 2 messages per turn (assistant response + user tool result). Two breakpoints slide forward in lockstep:

  • B2 on the last message writes the full prefix to cache.
  • B1 on the previous B2 position reads the prior prefix from cache.

Each call after the first gets a cache hit for the entire prefix minus the newest turn. _prev_cache_index on the model instance tracks where B2 was placed so that B1 can be positioned on the next call.

Why shallow copies. The contents list is owned by the agent loop and persisted in sessions. _annotate_cache_breakpoints() creates a shallow copy of the messages list and replaces only the B1/B2 entries with copies that have cache_control injected. The originals are never mutated, so no cleanup is needed after the API call and cache_control never leaks into saved sessions.

Ephemeral message interaction. When the agent loop injects a transient message (see 6.3.1), generate() receives ephemeral=True. B2 is not placed (no cache write) so the transient content never enters the cache. _prev_cache_index is not updated, so the next non-ephemeral call’s B1 still points to the last valid write and produces a cache hit. B1 is still placed to provide a cache read for the stable prefix on the ephemeral call itself.

Cost estimation. MODEL_PRICING in config.py stores a 4-tuple per model prefix: (input, output, cache_read, cache_write) per million tokens. estimate_cost() computes: uncached * input + cache_read_tokens * cache_read_rate + cache_creation_tokens * cache_write_rate + output * output_rate. Gemini has cache_write = 0 (implicit caching has no write surcharge); Claude has non-zero cache_write (1.25x the base input rate for explicit breakpoints).

7. Tool System and Auto-Spill

7.1. Tool registry design

ToolRegistry is the single point of dispatch for all tool calls. The agent calls registry.execute(tool_call) and gets a dict back. It never calls sandbox methods or data source functions directly.

Three built-in tools are always available:

  • sandbox_exec – runs sh -c <command> in the sandbox container. Returns {exit_code, stdout, stderr}, with auto-spill for large outputs.
  • fetch_to_sandbox – calls a data source function and writes the result to a specified path in the sandbox. Returns metadata only (path, byte count, line count). Used when the model wants explicit path control.
  • fetch_batch_to_sandbox – calls multiple fetch_to_sandbox in one tool call. Saves iterations vs. sequential calls (e.g., fetching both pipelineruns and taskruns in one round-trip).

Data source tools are registered dynamically per workflow config. Each data source module provides a register() function that creates ToolDef objects and closures over credentials, then calls registry.register_data_source(tool_def, func, response_metadata).

7.2. Auto-spill architecture

Auto-spill is the key mechanism for keeping the LLM context bounded. Without it, large outputs accumulate in the contents list and inflate every subsequent API call (because both APIs replay the full history).

flowchart TD
    Output["Tool produces output"]
    SizeCheck{"size > OUTPUT_PREVIEW_BYTES<br/>(4 KB)?"}
    Inline["Return full output inline"]
    Spill["Write full output to<br/>/tmp/data/_out/N.txt"]
    Preview["Return preview<br/>(head + tail) +<br/>file path metadata"]

    Output --> SizeCheck
    SizeCheck -->|"<= 4 KB"| Inline
    SizeCheck -->|"> 4 KB"| Spill --> Preview

Three spill paths share the same _spill_counter to prevent filename collisions:

sandbox_exec spill. When stdout or stderr exceeds OUTPUT_PREVIEW_BYTES (4096 bytes), the full output is written to /tmp/data/_out/{counter}.txt via _spill_field(). The model receives:

{
  "exit_code": 0,
  "stdout": "<first 4KB>",
  "stdout_truncated": true,
  "stdout_file": "/tmp/data/_out/0.txt",
  "stdout_bytes": 32768,
  "stdout_lines": 1024,
  "stdout_tail": "<last 512 bytes>"
}

The preview (head + tail) gives the model enough context to decide whether to process the full file with jq/grep/head.

Data source inline spill. When a direct data source call returns text larger than MAX_INLINE_SIZE (4096 bytes), _spill_data_source() writes it to /tmp/data/_out/{name}_{counter}.txt and returns:

{
  "saved_to": "/tmp/data/_out/konflux_list_pipelineruns_1.txt",
  "bytes": 85432,
  "lines": 2100,
  "preview": "<first 4KB>"
}

Streaming responses (e.g., gitlab_get_repo_archive which returns a StreamingResponse with suffix .tar.gz) are written via _spill_streaming(). Text streams (binary=False, the default) capture a UTF-8 preview from the head while piping chunks to the sandbox. Binary streams (binary=True) skip preview capture entirely and return only {saved_to, bytes}, avoiding meaningless decoded output for formats like tar.gz. The StreamingResponse.suffix field controls the file extension (.jsonl, .tar.gz, .log).

Non-streaming binary responses are written to .bin files with no preview via _spill_binary().

fetch_to_sandbox spill. Always writes to the caller-specified path (not /tmp/data/_out/). Returns metadata only (no preview). The model uses this when it wants a specific filename for later processing.

7.3. Why auto-spill instead of rejecting large outputs

The earlier design rejected large data source responses with “use fetch_to_sandbox instead.” This caused two wasted iterations per rejection: the model makes the call, gets rejected, then has to repeat with fetch_to_sandbox. With auto-spill, the data flows to a sandbox file transparently. Validation showed ~10% fewer iterations with auto-spill.

7.4. Why fetch_to_sandbox still exists alongside auto-spill

Auto-spill handles the common case, but fetch_to_sandbox provides:

  • Explicit path control. The model can choose meaningful filenames (/tmp/data/pipelineruns.json) instead of getting auto-generated names (/tmp/data/_out/konflux_list_pipelineruns_1.txt).
  • Batch fetching. fetch_batch_to_sandbox combines multiple fetches in one tool call, saving iterations.
  • No preview overhead. fetch_to_sandbox returns metadata only, which is useful when the model knows it will process the file with sandbox_exec anyway.

7.5. ToolDef notes

ToolDef has an optional notes field for domain knowledge that belongs in the system prompt but not in the tool’s JSON schema. Examples:

  • Konflux notes explain dual K8s/Kubearchive fetch, UID deduplication, and the two label selectors (BUILD vs. TEST).
  • Testing Farm notes explain XML result structure and usage patterns.

ToolRegistry.get_tool_notes() collects all non-None notes into a ## Data Source Notes section that is prepended to the workflow body in the system prompt:

BASE_SYSTEM_PROMPT + tool_notes + workflow_body

This keeps domain knowledge close to the tool definitions (in the data source module) rather than duplicated in every workflow .md file.

7.6. Response metadata

register_data_source() accepts optional response_metadata – a dict that is merged into every response from that tool (inline, auto-spill, and fetch_to_sandbox). Used for:

  • Konflux: {"konflux_ui": "https://..."} so the model can build reviewer-facing PipelineRun links.
  • Testing Farm: {"artifacts_base": "https://..."} so the model can build artifact links.

This avoids having the model ask “what is the Konflux UI URL?” – the information arrives with every tool response.

8. Sandbox Architecture

8.1. The Sandbox protocol

All backends implement an 8-method typing.Protocol:

class Sandbox(Protocol):
    def start(self) -> None                                                  # create container/pod, pre-create /tmp/data/
    def exec(command, *, stdin_data: bytes | None) -> ExecResult             # sh -c in sandbox
    def write_file(path, data: bytes) -> None                               # stdin pipe: cat > path
    def stream_to_file(path, chunks: Iterable[bytes]) -> tuple[int, int]    # Popen stdin pipe, returns (bytes, lines)
    def stream_exec(command, chunks: Iterable[bytes]) -> ExecResult         # Popen with streaming stdin
    def read_file_iter(path) -> Iterator[bytes]                             # Popen stdout pipe in chunks
    def cleanup(self) -> None                                               # rm -f container / delete pod
    def linger(session_id) -> None                                          # keep alive for reuse, or fall back to cleanup

ExecResult is (exit_code: int, stdout: str, stderr: str).

stream_to_file() and stream_exec() use subprocess.Popen with a stdin pipe, writing chunks incrementally. This avoids buffering large payloads in orchestrator memory (e.g., streaming a tar.gz archive into the sandbox). read_file_iter() reads from a Popen stdout pipe in 64 KB chunks.

write_file() uses stdin piping (cat > path), never host volume mounts. This is critical: it means data flows through the orchestrator process, not through a shared filesystem. The sandbox has no host mounts.

8.2. Backend selection

Three sandbox backends are available:

Backend --sandbox When to use
PodmanSandbox podman Local development (default for run)
K8sSandbox k8s Direct pod creation on a K8s cluster
K8sPoolSandbox k8spool Pre-warmed pool via Deployment (default for serve)

create_sandbox() factory handles podman and k8s. When --sandbox k8spool is selected, __main__.py creates a SandboxPool and passes it to run_workflow(), which instantiates K8sPoolSandbox directly.

K8s is never auto-detected from the environment. It requires explicit CLI flags. This prevents accidental use of a K8s sandbox when developing locally.

For K8s namespace resolution (k8s and k8spool):

  1. If --namespace is provided, use it.
  2. If --context is provided, extract the namespace from the kubeconfig context. If the context has no default namespace, raise an error.
  3. If neither is provided (in-cluster), use sandbox.namespace from the config file.

8.3. PodmanSandbox

podman run -d --name hb-sandbox-{uuid8} --network=none --user 65532 \
    --workdir /tmp {image} sleep infinity
  • Unique container name with UUID suffix prevents collisions
  • sleep infinity keeps the container alive for repeated exec calls
  • --network=none provides complete network isolation
  • --user 65532 is a fixed non-root UID (matches nonroot in distroless)
  • Cleanup: podman rm -f (force, in case exec is still running)

8.4. K8sSandbox: the hybrid approach

The K8s sandbox uses a hybrid of two tools:

  • kubernetes Python library for pod lifecycle: create_namespaced_pod, read_namespaced_pod (poll for Running), delete_namespaced_pod.
  • kubectl exec subprocess for command execution.

Why not use the kubernetes Python library for exec too? Three problems discovered during development:

  1. No stdin EOF signal in WebSocket v1-v4. The Kubernetes exec protocol uses WebSocket channels (stdin=0, stdout=1, stderr=2). Protocol versions 1-4 have no mechanism to signal “stdin is done.” Commands like cat > /file hang forever waiting for more input. Python client v5 support does not exist.

  2. BrokenPipeError on large stdin. Sending more than ~1MB through the WebSocket stream() API causes pipe errors, breaking write_file() for large data source responses.

  3. Unbounded memory from stream(). The stream() function accumulates all stdout/stderr data in memory with no streaming control. A command producing megabytes of output would consume unbounded memory.

kubectl exec as a subprocess avoids all three problems and provides the same interface as podman exec – stdin via subprocess.PIPE, stdout/stderr captured, exit code from return code. The implementation in exec() is nearly identical between PodmanSandbox and K8sSandbox.

8.5. Pod manifest design

The K8s pod manifest is built by _build_pod_manifest():

metadata:
  labels:
    app.kubernetes.io/name: hummingbird-agent-sandbox
spec:
  automountServiceAccountToken: false  # no K8s API from sandbox
  activeDeadlineSeconds: 1800          # 30min hard timeout, backstop
  restartPolicy: Never
  securityContext:
    runAsNonRoot: true
    seccompProfile: RuntimeDefault
  containers:
  - name: sandbox
    command: ["sleep", "infinity"]
    workingDir: /tmp
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
    resources:
      requests: {cpu: 100m, memory: 256Mi, ephemeral-storage: 256Mi}
      limits: {cpu: "1", memory: 1Gi, ephemeral-storage: 2Gi}

runAsUser is deliberately omitted. On OpenShift, the restricted-v2 SCC assigns a UID from the namespace UID range. On vanilla K8s, the image’s USER directive is used. Setting an explicit UID would conflict with OpenShift’s SCC admission.

8.6. Auth modes

  • Local development: --context flag passes the kubeconfig context to config.new_client_from_config(context=...), creating a per-instance ApiClient.
  • Production (in-cluster): context=None triggers config.load_incluster_config(), using the pod’s ServiceAccount token.

The kubectl exec commands include --context when running locally but omit it when in-cluster (kubectl uses the default in-cluster config).

8.7. Pre-created directories

Both backends run mkdir -p /tmp/data in start() after the container/pod is up. This provides a working directory for the model without consuming an iteration. /tmp/data/_out (the spill directory) is created on demand by mkdir -p $(dirname ...) in write_file().

8.8. K8sPoolSandbox: Deployment-backed pool

On-demand pod creation adds 5-30 seconds of latency per workflow (image pull, scheduling, container start). For interactive use (slash commands, reply-based resumption), this delay is user-facing. The pool eliminates it.

Mechanism. A Kubernetes Deployment maintains a set of pods labelled hummingbird/role: standby. When a sandbox is needed, SandboxPool.claim() finds a Running standby pod, patches it to role: active, clears its ownerReferences (detaching it from the ReplicaSet), and annotates it with hummingbird/reap-by (an absolute UTC deadline). The Deployment controller sees the ReplicaSet is under the desired replica count and creates a replacement.

Pod lifecycle.

  1. Deployment creates pod -> role: standby (managed by ReplicaSet)
  2. claim() patches -> role: active, ownerReferences: [], reap-by: now + max_active_seconds
  3. Agent uses the pod via inherited K8sSandbox.exec()
  4. On success: linger() patches -> reap-by: now + linger_seconds, session-id: <id> (pod stays alive)
  5. On reply within linger window: try_reclaim() patches -> reap-by: now + max_active_seconds, clears session-id
  6. On failure or linger expiry: pod is deleted (by cleanup() or the reaper)

Pod lingering. After a successful workflow, pool sandboxes are kept alive for linger_seconds (default 300) instead of being deleted immediately. The pod is annotated with hummingbird/session-id to link it back to the session. If a user reply arrives within the linger window, try_reclaim() finds the pod by session-id, clears the annotation (marking it as in-use), and resets reap-by. This skips pod creation and S3 archive restoration. If no reply arrives, the reaper deletes the pod when its reap-by deadline passes.

Concurrency safety. session-id is only present while lingering. Its absence during active execution prevents concurrent replies from adopting the same pod. try_reclaim() is serialized by a threading lock; the first caller wins, others fall back to S3 restore with a fresh pod.

Reaping. reap_expired() is called once per workflow execution (after sandbox acquisition, before the model loop). It performs a single sorted pass over all active pods:

  1. Non-lingering pods (no session-id) past their reap-by deadline are deleted immediately.
  2. Lingering pods (with session-id) are sorted by reap-by ascending. The loop deletes pods that are either expired (now > reap-by) or exceed max_lingering_pods (default 2, configurable), starting with those closest to their deadline. The loop stops when both conditions are satisfied (now <= reap-by and remaining count <= limit).

This replaces the previous design where _reap_expired() ran inside claim() on every poll iteration. Moving it to a once-per-workflow call reduces API load and centralizes cleanup. The max_lingering_pods cap prevents unbounded pod accumulation from many short-lived workflows.

Indefinite wait. claim() polls until a pod is available or the shutdown event fires. It logs at DEBUG level on each poll, escalating to WARNING after ~2 minutes. This is normal behavior when the pool (Deployment replicas) is smaller than max_concurrent_agents – the SQS semaphore caps concurrency, so demand will not permanently exceed supply.

Config simplification. In pool mode, the pod image, resources, metadata, and security context are defined solely in the Deployment template. The agent config only needs sandbox.namespace and sandbox.active_deadline_seconds. This eliminates duplication between the agent configmap and the Deployment.

Class design. K8sPoolSandbox inherits from K8sSandbox to reuse exec(), write_file(), read_file_iter(), and cleanup(). It overrides __init__() (does not call super().__init__() since that resolves kubeconfig), start() (claims from pool instead of creating a pod), and linger() (patches annotations instead of deleting). It adds start_from_existing() for the reclaim path. All sandbox backends implement linger(session_id): non-pool backends fall back to cleanup().

9. Data Sources

Data sources are external APIs wrapped as tool-calling functions. The model invokes them by name; the orchestrator executes them and returns results (inline or auto-spilled). Each data source module follows the same pattern:

  1. Define TOOL_DEFS – a list of ToolDef objects with names, descriptions, parameter schemas, and optional notes.
  2. Define implementation functions that take a pre-configured client as the first argument.
  3. Provide a register() function that creates the client, wraps implementations with functools.partial, and calls registry.register_data_source() for each tool.

Credentials are captured at registration time via functools.partial closures. They are never stored as global state and never leak into tool definitions or model contents.

9.1. GitLab

Uses python-gitlab library with retry_transient_errors=True for automatic retry on transient HTTP errors.

Tools:

  • gitlab_get_mr_details – MR metadata (title, author, SHA, labels, URLs)
  • gitlab_get_commit_statuses – all CI/CD statuses for a commit SHA (paginated automatically). Covers both Konflux external statuses and native GitLab CI job statuses. Each status includes target_url (job page link containing the job ID) and allow_failure.
  • gitlab_get_mr_diff – changed files in the MR
  • gitlab_get_file_at_ref – raw file content at a git ref
  • gitlab_get_repo_archive – repository tar.gz via repository_archive(iterator=True). Returns a StreamingResponse with chunked binary data (suffix .tar.gz), so the archive streams to the sandbox without buffering in orchestrator memory.
  • gitlab_get_job_log – job trace (log output) for a GitLab CI job. Uses lazy=True on the job object and trace(iterator=True) for streaming. Each chunk is decoded with errors='replace' and ANSI escape codes are stripped per-chunk before re-encoding. Returns a StreamingResponse (suffix .log). Per-chunk ANSI stripping is safe because escape sequences are <20 bytes and chunks are 1024+ bytes.

No response_metadata is set because GitLab commit statuses already contain target_url fields that the model uses for linking.

gitlab_get_mr_discussions – trust filtering and redaction.

MR discussions are the first data source that feeds user-generated free text into the model prompt. Unlike diffs and CI logs (which are code or machine output), discussion comments can contain arbitrary prose written by anyone who can post on the MR – including external contributors on public projects. This creates a prompt injection vector: an attacker posts a comment containing instructions that manipulate the model’s review output.

The discussions tool addresses this with a three-layer defence:

  1. Author trust gate. Each note’s author is checked for Developer+ access (level >= 30) on the project via members_all.get(author_id). Results are cached per author_id within a single call to avoid repeated API lookups. System notes (merge events, label changes) bypass the author check. Agent-authored notes are trusted via their author’s Developer+ access (the bot account holds Developer access on configured projects). Note: agent note detection for redaction purposes (stripping transcripts and footers) uses session marker presence, but trust is always based on the author’s access level, not the marker. This prevents marker injection from granting trust to non-member notes.

  2. Redaction, not sanitization. Untrusted notes in a discussion that also contains trusted notes are replaced with a fixed placeholder ([redacted: non-member comment]). The placeholder preserves the conversation structure (the model sees that someone replied, but not what they said). Discussions where all notes are untrusted are dropped entirely. No attempt is made to sanitize or escape untrusted content – there is no reliable escaping mechanism for LLM prompts, so the only safe option is to withhold the content entirely.

  3. Agent note stripping. Agent-authored notes contain large <details><summary>Agent transcript</summary>...</details> blocks (15K-90K chars of raw tool calls), metadata footers, and session markers. These are stripped before the note enters the model prompt, leaving only the review text. This serves dual purposes: token efficiency and avoiding feeding the model its own raw tool call history (which could cause degenerate self-referential loops).

Residual risks and scope limitations:

  • A compromised Developer+ account can inject adversarial content. This is accepted as equivalent to the existing risk of a compromised developer pushing malicious code (which the agent would also process).
  • The trust threshold is project-level (Developer role on the project), not MR-level. A developer on the project can influence any MR’s review.
  • Comments posted after the discussions are fetched but before the review is posted are not seen. This is a TOCTOU gap but has no security impact (the model simply misses late comments).

9.2. Konflux

Uses raw requests against K8s and Kubearchive APIs (not the kubernetes Python SDK). This avoids the heavy kubernetes client dependency for what is essentially bearer-token HTTP with label selectors.

Dual-fetch architecture:

flowchart LR
    subgraph fetch [Fetch Phase]
        KA["Kubearchive<br/>(historical)"]
        K8s["K8s API<br/>(live)"]
    end
    Combine["Combine items"]
    Dedup["Deduplicate<br/>by metadata.uid"]

    KA --> Combine
    K8s --> Combine
    Combine --> Dedup

For each resource type (PipelineRuns, TaskRuns), the client fetches from both Kubearchive (completed/historical resources) and the live K8s API, combines the results, and deduplicates by metadata.uid. This ensures no resources are missed regardless of whether they have been archived yet.

Two label selectors. Konflux uses different labels for BUILD and TEST PipelineRuns:

  • BUILD: pipelinesascode.tekton.dev/sha=<commit_sha>
  • TEST: pac.test.appstudio.openshift.io/sha=<commit_sha>

Each get_pipelineruns()/get_taskruns() call queries both selectors and combines the results.

Pod log fetching. get_pod_log() accepts an optional container parameter. When specified, it fetches logs for that single container. When omitted, it discovers all containers from the pod spec and concatenates their logs with === container_name === headers. Each individual log fetch tries Kubearchive first, then falls back to the live K8s API. Logs may be unavailable from both sources if the pod has expired.

Streaming pagination. K8s list endpoints can return pages of 80+ MB when a commit touches many components (e.g. 500 PipelineRuns per page). iter_paginated() uses session.get(stream=True) and ijson.parse() to stream-parse each page: items are yielded one at a time via ObjectBuilder, and the metadata.continue pagination token is captured from the same parse pass. resp.raw.decode_content = True is set so urllib3 transparently decompresses gzip/deflate Content-Encoding inline (Kubearchive returns gzip-compressed responses). Only one item is in memory at a time – O(single_item) regardless of page size. Truncated or malformed streams raise IncompleteJSONError, which is caught and treated like a network error (log a warning, stop iterating that source).

Credential resolution. KonfluxClient.__init__() parses the kubeconfig file to extract the API server URL and bearer token for the cluster. The cluster domain is extracted from the cluster_url config value. This approach avoids depending on kubectl or the kubernetes Python library for API authentication.

Response metadata: {"konflux_ui": "https://konflux-ui.apps.<domain>/ns/<namespace>"} is merged into every response so the model can build reviewer-facing links like [{name}]({konflux_ui}/pipelinerun/{name}).

9.3. Testing Farm

Uses requests.Session with a module-level retry adapter (429, 5xx) for resilience against transient errors.

Tools:

  • tf_get_results – JUnit XML results for a request ID
  • tf_get_test_log – individual test log by URL (from results.xml); restricted to the ARTIFACTS_BASE prefix to prevent SSRF
  • tf_get_request_status – request state, queue/run times

ToolDef notes on tf_get_results document the XML structure: //testsuites/@overall-result, //testcase/@result, //testcase/logs/log with @name and @href. This domain knowledge goes into the system prompt so the model knows how to parse the XML with sandbox_exec using python3 xml.etree.ElementTree.

Response metadata: {"artifacts_base": "https://artifacts.osci.redhat.com/testing-farm"} is merged into every response so the model can build artifact links like [{request_id}]({artifacts_base}/{request_id}/).

9.4. Registration flow

data_sources.register_selected() is the entry point called by runner.py. It iterates the workflow config’s data_sources dict and registers only the declared sources:

for ds_name, ds_cfg in wf_cfg.data_sources.items():
    if ds_name == "gitlab":
        token = resolve_tool_token(proj_entry, "gitlab", wf_cfg)
        gitlab.register(registry, gitlab_url, token)
    elif ds_name == "konflux":
        cluster_url = resolve_cluster_url(ds_cfg)
        kubeconfig_path = os.environ.get(ds_cfg.kubeconfig_env, "")
        konflux.register(registry, cluster_url, kubeconfig_path, ds_cfg.kubearchive_url)
    elif ds_name == "testing_farm":
        testing_farm.register(registry)

This selective registration means a workflow with data_sources: {gitlab: {...}} only exposes GitLab tools to the model. Konflux and Testing Farm tools do not appear in the tool definitions, preventing the model from attempting to use unconfigured data sources.

9.5. Data flow: streaming vs buffered

Data moves from external APIs through the orchestrator to the model (inline or auto-spilled to sandbox). The memory profile of each tool depends on whether the HTTP response is consumed incrementally or buffered entirely.

Streaming (preferred for large/unbounded responses). The HTTP response is consumed incrementally – via ijson stream parsing, iterator=True in python-gitlab, or lazy pagination. Only one chunk or item is in memory at a time. Used by:

  • iter_paginated() (Konflux) – session.get(stream=True) + ijson
  • get_commit_statuses() (GitLab) – statuses.list(iterator=True) via _iter_commit_statuses(), yielding JSONL bytes one status at a time
  • get_repo_archive() (GitLab) – repository_archive(iterator=True), returns chunked tar.gz binary via StreamingResponse(suffix=".tar.gz", binary=True) – no text preview is generated
  • get_job_log() (GitLab) – trace(iterator=True) with per-chunk ANSI stripping, returns StreamingResponse(suffix=".log")

All produce StreamingResponse objects that flow through stream_to_file() into the sandbox without accumulating in orchestrator memory. StreamingResponse.suffix controls the auto-spill filename extension (.jsonl, .tar.gz, .log). StreamingResponse.binary controls whether a head preview is captured (False for text, True to skip for opaque binary formats).

Buffered (acceptable for small/bounded responses). The full response is loaded into memory as a string or dict. This is fine when the response size is bounded and small (typically < 1 MB). Used by:

  • get_mr_details() (GitLab) – single MR object, < 10 KB
  • get_file_at_ref() (GitLab) – single file, bounded by repo constraints
  • tf_get_results(), tf_get_test_log(), tf_get_request_status() (Testing Farm) – XML/text/JSON, typically < 1 MB

Buffered with risk (candidates for future streaming). Same buffered pattern but the response size is not bounded by design. Auto-spill mitigates the context growth problem (large outputs are written to sandbox files), but the orchestrator still spikes RSS during the fetch:

  • get_pod_log() (Konflux) – resp.text per container, concatenated. Multi-container pods accumulate all logs.
  • get_mr_diff() (GitLab) – full diff JSON, scales with MR size. GitLab truncates server-side but the result can still be large.

Invariant: paginated K8s list endpoints must always use streaming. Page sizes scale with the number of components in the commit – a single page can contain hundreds of PipelineRuns or TaskRuns (80+ MB JSON). Buffering these responses risks OOM under normal production workloads, especially with concurrent workflows.

10. Event Pipeline and Production Operations

10.1. Two-stage SQS pipeline (ingress router + FIFO worker)

Events flow through a two-stage pipeline to serialize per-session work:

flowchart LR
  STD["Standard SQS<br/>(ingress)"]
  R["Router"]
  FIFO["SQS FIFO<br/>(work queue)"]
  W["Worker<br/>(poll_loop)"]

  STD -->|ReceiveMessage| R
  R -->|"SendMessage<br/>MessageGroupId"| FIFO
  W -->|"continuation<br/>group=session_id"| FIFO
  FIFO -->|ReceiveMessage| W

Ingress router. events.router_loop() is a single-threaded loop that long-polls the standard (ingress) queue, peeks into each SNS envelope to assign a MessageGroupId, forwards the raw message body to the FIFO queue, then deletes from the standard queue.

  • Note events: MessageGroupId = discussion_id from the webhook body. This serializes replies to the same agent session – FIFO delivers at most one in-flight message per group.
  • Pipeline / MR events: MessageGroupId = SQS MessageId (unique per message). No serialization – each event is its own group.
  • MessageDeduplicationId: Always the SQS MessageId from the standard queue. Absorbs duplicate deliveries from standard SQS’s at-least-once semantics (5-minute dedup window).

If SendMessage to FIFO fails, the message is not deleted from the standard queue and retries via visibility timeout.

FIFO worker. events.poll_loop() long-polls the FIFO queue, decodes SNS envelopes, and dispatches to handle_event. The max_concurrent_agents semaphore caps parallel groups being processed. FIFO guarantees that within a group, only one message is in-flight.

Two-phase processing. When the FIFO worker picks up a webhook (Phase 1), the handler validates the event, resolves the session_id, then posts a continuation message back to the same FIFO with MessageGroupId = session_id. The Phase 1 message is deleted quickly (~1-5s). Phase 2 processes the continuation: loading session state, building WorkflowRequest, and calling _execute_workflow. Because Phase 2 continuations share a MessageGroupId per session, all work for a given session is serialized – even if it spans multiple GitLab discussion threads.

Uniform message format. Both webhook messages (from SNS) and continuation messages use the same SNS-style envelope with gzip+base64 compression. encode_envelope() is the inverse of decode_sns_message(). The consumer doesn’t need to distinguish between webhook and continuation messages at the transport layer.

Graceful shutdown. SIGTERM/SIGINT set a shutdown_event. Both loops exit. Messages in SQS become visible again after the visibility timeout for other consumers.

Why two queues. The standard queue receives events from SNS (via subscription). The FIFO queue serializes per-session work. The gitlab-event-forwarder and SNS subscription are unchanged – the routing logic lives entirely in the agent codebase.

10.2. SNS envelope decoding

Messages arrive as SNS notifications with:

  • MessageAttributes.source – event source (e.g., "gitlab")
  • MessageAttributes.event_type – event type (e.g., "pipeline", "merge_request", "note")
  • MessageAttributes.content_encoding"gzip+base64" for gitlab-event-forwarder, absent for kubernetes-event-forwarder
  • Message – the actual event body (JSON string, or gzip+base64 encoded)

decode_sns_message() handles both encoding formats transparently.

10.3. Event routing

runner.handle_event() dispatches on (source, event_type):

flowchart TD
    Event["Event arrives"]
    Check{"source/type?"}
    Cont["handle_continuation()"]
    Pipeline["handle_pipeline()"]
    MRHandler["handle_merge_request()"]
    Note["handle_note()"]
    Ignore["Ignore"]

    Event --> Check
    Check -->|"agent/continuation"| Cont
    Check -->|"gitlab/pipeline"| Pipeline
    Check -->|"gitlab/merge_request"| MRHandler
    Check -->|"gitlab/note"| Note
    Check -->|"other"| Ignore

    Pipeline --> FilterStatus{"status failed?<br/>source = merge_request_event?"}
    FilterStatus -->|"yes"| PipeLookup["filter: trigger=pipeline<br/>+ ignore_users"]
    FilterStatus -->|"no"| Skip1["Skip"]
    PipeLookup --> Enqueue1["_enqueue_continuation()"]

    MRHandler --> FilterMRAction{"action in open/reopen/update?<br/>draft = false?"}
    FilterMRAction -->|"yes"| MRLookup["filter: trigger=merge_request<br/>+ ignore_users + ignore_branches"]
    FilterMRAction -->|"no"| Skip1b["Skip"]
    MRLookup --> Enqueue2["_enqueue_continuation()"]

    Note --> FilterMRNote{"MR note?<br/>action = create?"}
    FilterMRNote -->|"yes"| NoteType
    FilterMRNote -->|"no"| Skip3["Skip"]

    NoteType{"Note type?"}
    NoteType -->|"author_id == bot_id"| Skip4["Skip (own note)"]
    NoteType -->|"/hummingbird help"| Help["post_simple_reply (help)"]
    NoteType -->|"/hummingbird wf-name"| SlashCmd["_resolve_slash_workflow()<br/>→ _enqueue_continuation()"]
    NoteType -->|"DiscussionNote reply"| FindSession["find_session_for_reply()"]
    NoteType -->|"other"| Skip5["Skip"]

    FindSession --> Found{"session found?"}
    Found -->|"yes"| Enqueue3["_enqueue_continuation()<br/>(resume_session)"]
    Found -->|"no"| Skip6["Skip"]

    Cont --> ContType{"work_type?"}
    ContType -->|"new_session"| ExecNew["_execute_workflow()"]
    ContType -->|"resume_session"| LoadS3["load session from S3<br/>→ _execute_workflow()"]

Phase 1 handlers (handle_pipeline, handle_merge_request, handle_note) perform validation, auth checks, and session resolution, then post a WorkOrder continuation message via _enqueue_continuation(). They do not call _execute_workflow directly.

Phase 2 (handle_continuation) receives the WorkOrder, resolves the workflow config, builds a WorkflowRequest, and calls _execute_workflow. For resume_session orders, it loads the previous session from S3. _execute_workflow handles SHA dedup and per-workflow rate limiting via scan_agent_threads.

Slash commands dispatch to a single named workflow via _resolve_slash_workflow (exact match, then prefix match). Help and error replies are posted via post_simple_reply (no session marker, no rate limit impact).

10.4. Pipeline trigger design

Why gitlab::pipeline instead of kubernetes::PipelineRun: A GitLab pipeline event fires once when the pipeline completes. Since Konflux external stages are attached to the pipeline, the event naturally waits for all builds and tests to finish before triggering. This means the agent sees the full picture in one event, without needing to deduplicate or wait for stragglers.

Trigger filters:

  • status in {failed} – only failed pipelines trigger analysis
  • source == "merge_request_event" – only MR pipelines, not branch/tag

10.4a. Merge request trigger design

The handle_merge_request handler fires on MR open, reopen, and update events (action in {"open", "reopen", "update"}). Draft MRs are skipped (object_attributes.draft == True); marking a draft as ready triggers a review since the event arrives with draft: false.

Event classification is deliberately simple: the handler does not inspect oldrev or changes.draft fields. Instead, SHA-based deduplication in _execute_workflow (via JSON session markers) ensures each code revision is reviewed at most once. Metadata-only updates (title, label changes) on an already-reviewed SHA are silently skipped.

10.5. Note trigger design

Two sub-flows:

Slash command (/hummingbird <workflow-name>): Triggers a specific named workflow, bypassing rate limiting. _resolve_slash_workflow performs exact-match lookup first, then falls back to unique prefix matching. If the subcommand is missing or is “help”, _format_help returns a list of available workflows with descriptions. Ambiguous or unknown subcommands produce an error message via post_simple_reply. The note author must have Developer+ access (level >= 30) on the project, checked via check_member_access(). If denied, the agent replies in the same discussion thread with a short access-denied message.

Reply to agent note: When a user replies to an existing agent note (which contains a JSON session marker):

  1. find_session_for_reply() walks the discussion thread, filtering by bot author ID (get_bot_user_id()), and returns the first matching session marker. All bot markers in a thread share the same session ID.
  2. handle_note posts a resume_session continuation to the FIFO (grouped by session_id).
  3. handle_continuation loads the session from S3 (conversation history and sandbox archive), builds a WorkflowRequest, and calls _execute_workflow() to resume with the user’s reply.
  4. If the session is not found (expired/deleted), the agent replies with a message explaining the session has expired and suggesting to start a new run.

Reply authors are also subject to the Developer+ access check. If denied, the agent replies in the discussion thread with the same access-denied message.

Non-agent-directed notes: Regular comments that are neither slash commands nor replies to agent threads are silently ignored (debug-level log). No access check is performed for these.

Self-note filtering: Notes where author_id matches the bot’s own user ID (resolved via get_bot_user_id()) are skipped immediately. This prevents infinite loops where the agent’s own output triggers another agent run. The author-ID check replaces the previous substring check (marker_prefix_str in note_body), which was vulnerable to denial-of-service: a user including the marker prefix in their reply would cause the bot to silently ignore it.

Threading logic for replies: If the project requires internal notes (internal_notes: true) but the original discussion was public, the reply is posted as a new top-level internal note instead of replying in the public thread. This prevents leaking internal analysis into public threads.

10.6. Note lifecycle

1. create_placeholder_note()   -> "Running the <workflow> workflow..."
                                   + JSON session marker (id, wf, sha)
2. run_workflow()              -> agent loop
3. update_note()               -> replace placeholder with result
                                   + reply prompt + JSON session marker

Reply notes (from session resumption) omit the reply prompt since the
user is already engaged in the conversation.

On failure, the placeholder is updated to “Hummingbird analysis failed.” The JSON session marker is always present so the note can be identified as agent-generated (for rate limiting, SHA dedup, and self-note filtering). The marker embeds the workflow name and commit SHA, enabling per-workflow rate limiting and SHA-based deduplication.

10.7. Rate limiting and SHA deduplication

scan_agent_threads() performs a single pass over MR discussions, parsing JSON session markers (<!-- hummingbird-session: {"id":...,"wf":...,"sha":...} -->) to determine:

  1. Per-workflow thread count – how many threads the given workflow has created on this MR. If the count meets or exceeds max_runs_per_mr, the workflow is skipped.
  2. SHA dedup – whether the current commit SHA has already been reviewed by this workflow. If so, the workflow is skipped (prevents re-reviewing the same code on metadata-only MR updates).

Slash commands and replies bypass both checks entirely.

The rate limit is per-workflow per-MR: different workflows maintain independent thread counts on the same MR. JSON markers are backward compatible – old plain-UUID markers (<!-- hummingbird-session: UUID -->) are parsed as {"id": "UUID"} with no workflow or SHA information, so they are not counted toward any specific workflow’s limit.

11. Session System

Sessions enable conversation continuity: a user can reply to an agent note and the agent picks up where it left off, with full context and sandbox files restored.

11.1. What gets saved

Three artifacts are saved after each run:

  • context.json – the full contents list from the agent loop. This is the complete conversation history in Gemini wire format (user turns, model turns with tool calls, tool response turns). It is the minimum state needed to resume the conversation.
  • transcript.md – a human-readable markdown rendering of the run for debugging and auditing. Not used for resumption.
  • sandbox.tar.gz – an archive of /tmp/data/ (the sandbox working directory, which includes the _out/ spill subdirectory). Contains PipelineRun JSONs, test logs, jq output, and any other files the model created during the run. Restored into the new sandbox on resumption so the model can reference its previous work.

11.2. Storage backends

S3 (production):

s3://{bucket}/sessions/{session_id}/context.json
s3://{bucket}/sessions/{session_id}/transcript.md
s3://{bucket}/sessions/{session_id}/sandbox.tar.gz

Local directory (development, --save-session):

{directory}/context.json
{directory}/transcript.md
{directory}/sandbox.tar.gz

Both backends have the same interface. S3 save is best-effort: wrapped in try/except so a transient S3 error does not prevent the MR note from being posted. The note always gets delivered first.

11.3. Sandbox archive transport

The sandbox archive requires special handling because the orchestrator cannot directly access the sandbox filesystem (no volume mounts):

Archive (inside sandbox):
  sb.exec("tar czf /tmp/_archive.tar.gz -C /tmp data")

Stream out (sandbox -> host temp file):
  for chunk in sb.read_file_iter("/tmp/_archive.tar.gz"):
      tmp.write(chunk)                     # 64 KB chunks, no base64

Restore (host temp file -> new sandbox):
  sb.stream_exec("tar xzf - -C /tmp", file_chunks())

Both directions use streaming binary I/O via Popen stdin/stdout pipes. No base64 encoding is needed: read_file_iter() yields raw bytes from the sandbox via a Popen stdout pipe, and stream_exec() feeds raw bytes into a Popen stdin pipe. The archive is never extracted on the orchestrator – it exists only as opaque bytes being transported between sandboxes. This is a security invariant: the orchestrator never parses or inspects the archive contents.

11.4. Resumption flow

flowchart TD
    Reply["User replies to agent note"]
    FindSession["find_session_for_reply()<br/>walks discussion thread"]
    LoadS3["load_s3(session_id)"]
    NotFound{"session found?"}
    Expired["Post session-expired reply"]
    NewSandbox["Start new sandbox"]
    Restore["restore_sandbox(archive)"]
    BuildContents["contents = old_contents + user_reply"]
    RunLoop["run_agent_loop(<br/>initial_contents, CONTINUATION_PROMPT)"]

    Reply --> FindSession --> LoadS3
    LoadS3 --> NotFound
    NotFound -->|"no"| Expired
    NotFound -->|"yes"| NewSandbox
    NewSandbox --> Restore --> BuildContents --> RunLoop

Key aspects:

  • Fresh iteration budget. The resumed session starts iteration 0 with the full max_iterations budget, regardless of how many iterations the previous session used.
  • Context window is the real limit. A typical 8-iteration cold start uses ~50-100K input tokens. The restored history is sent in full on every API call. The CONTEXT_LIMIT check applies and will force wrap-up if needed.
  • Session ID reuse. The resumed session keeps the original session ID. S3 state is overwritten in place with the updated conversation history. Since GitLab discussions are linear (not branching), there is no need for a tree of sessions – the conversation is a single sequential thread.
  • Graceful degradation. If the S3 session is not found (expired, deleted), the agent posts a reply explaining the session has expired and suggesting to start a new run. It does not fall back to a cold start.

11.5. Session markers

Every agent note contains a hidden HTML comment with a JSON payload:

<!-- hummingbird-session: {"id":"UUID","wf":"code-review","sha":"abc123"} -->

The JSON payload contains:

  • id – session UUID (always present)
  • wf – workflow name (present for new-format markers)
  • sha – commit SHA at the time of review (present for auto-triggered runs)

This marker serves four purposes:

  1. Resumption: find_session_for_reply() searches discussion threads for this marker to find the session ID. Only bot-authored notes are considered; the first matching marker wins.
  2. Per-workflow rate limiting: scan_agent_threads() counts discussion threads for a specific workflow using the wf field. Only bot-authored notes are scanned.
  3. SHA deduplication: scan_agent_threads() checks whether the current SHA has already been reviewed by the workflow using the sha field.
  4. Self-filtering: handle_note() compares the webhook event’s author_id against the bot’s own user ID (via get_bot_user_id()) to prevent infinite loops.

All marker-scanning functions resolve the bot user ID via gl.auth() (GET /user) on the orchestrator token. This ensures markers in non-bot notes (user replies, external contributors) are never parsed, preventing session hijacking via injected markers.

Old plain-UUID markers (<!-- hummingbird-session: UUID -->) are parsed as {"id": "UUID"} for backward compatibility. They are counted for resumption but not for per-workflow rate limiting or SHA dedup (no wf/sha information).

12. Workflow System

12.1. Separation of prompt and metadata

A workflow has two parts that live in different places:

  • Prompt (.md file): the LLM system prompt, verbatim. This is the investigation strategy, tool usage guidance, data patterns, and output format. Pure text, no code.
  • Metadata (YAML config): operational settings – action, model, max_iterations, data_sources, projects. This controls what the orchestrator does with the workflow, not what the model does.

This separation means changing the analysis strategy (e.g., adding a new investigation step) requires editing a markdown file. Changing operational parameters (e.g., which projects use this workflow, which model to use) requires editing the YAML config. Neither requires a code change.

12.2. System prompt layering

The system prompt sent to the model is built from three layers:

BASE_SYSTEM_PROMPT        # agent.py: sandbox rules, tool usage tips
+ tool_notes              # from ToolRegistry: per-data-source domain knowledge
+ workflow_body           # full content of e.g. workflows/analyze-failures.md

On session resumption, a fourth layer is appended:

+ CONTINUATION_PROMPT     # agent.py: "do not re-run the full workflow"

The system prompt is rebuilt every iteration (to allow warning suffixes to be appended), but the base content is stable. Warning suffixes are appended at the end so they override earlier instructions.

12.3. Design choice: strategy, not procedure

The workflow .md file describes strategy and guidance, not a rigid script. The model decides when and how to use each tool based on what it sees.

This matters because merge request failures are diverse. A fixed procedure would either miss edge cases (e.g., build failures before tests ran) or waste iterations on steps that don’t apply. By giving the model a strategy (“identify failing tests, fetch details, analyze root causes, group by similarity”), it can adapt to whatever it encounters.

The workflow does structure the investigation into phases (Data Collection, Analysis, Output) for clarity, but these are guidelines, not enforced checkpoints.

12.4. Workflow file anatomy (analyze-failures.md)

The primary workflow is structured as:

  1. Scope and approach – what this workflow analyzes and what it ignores
  2. Input – event JSON format (project, iid)
  3. Data Collection Phase:
    • Data.1: Fetch MR details (direct call, small response)
    • Data.2: Fetch commit statuses, identify failures
    • Data.3: Batch fetch PipelineRuns + TaskRuns (fetch_batch_to_sandbox)
    • Data.4: Process with jq, extract Testing Farm data in bulk
  4. Analysis Phase:
    • Analysis.1a: Investigate each failed PipelineRun individually
    • Analysis.1b: Summarize and group by root cause
  5. Output – markdown template with root causes, collapsible details, clickable links (PipelineRun, Testing Farm, test logs)
  6. Error Handling – partial report philosophy

Design considerations embedded in the workflow:

  • Efficient bulk fetching: PipelineRuns and TaskRuns are fetched in one fetch_batch_to_sandbox call, not individually.
  • Testing Farm data extracted in Data.4, analyzed in Analysis.1: All TF results.xml are fetched in bulk before analysis starts, enabling cross-failure pattern detection.
  • Log fetching is selective: Only 1-2 representative logs per failure pattern, not all logs. This keeps iteration count bounded.
  • Reviewer-facing URLs: The output template instructs the model to include clickable links using konflux_ui and artifacts_base from response metadata.

12.5. Prompt file resolution

get_prompt(workflow_cfg, config_dir) resolves the prompt: field relative to the config file’s directory:

prompt_path = config_dir / workflow_cfg.prompt
# e.g. /app/config.yaml with prompt: workflows/analyze-failures.md
# -> /app/workflows/analyze-failures.md

In the container image, workflows are baked in at /app/workflows/. A ConfigMap can override them at deploy time by mounting at the same path.

13. Deployment and Container

13.1. Container build strategy

The Containerfile uses an all-RPM builder+installroot+scratch pattern (no pip, no venv):

  1. Builder stage: Uses a Fedora-based builder image with dnf. Installs all dependencies as RPMs into a clean --installroot.
  2. Application code: Copies hummingbird_agent/ and workflows/ into the installroot at /app/.
  3. Final stage: FROM scratch, copies the entire installroot. No package manager, no shell beyond what RPMs provide.

RPM dependencies: python3-boto3, python3-google-auth+requests, python3-gitlab, python3-kubernetes, python3-pyyaml, python3-requests, python3-sentry-sdk, kubernetes1.35-client (for kubectl).

This approach was chosen over pip because:

  • All deps come from Fedora’s package repository – no PyPI supply chain risk
  • Smaller image (~22 MB saved vs. google-genai SDK alone)
  • Reproducible builds from known RPM versions
  • No compilation step (no gcc/python3-devel in the image)

13.2. Container runtime properties

CMD ["python3", "-m", "hummingbird_agent", "serve"]
WORKDIR /app
USER 65532

The default CMD runs serve mode for production. Local development uses run mode via explicit command override. USER 65532 matches the standard nonroot UID used by distroless images and the Podman sandbox.

13.3. K8s deployment manifests

Located in hummingbird-agent/kubernetes/:

deployment.yaml:

  • Single replica with rolling update (maxSurge: 1, maxUnavailable: 0)
  • terminationGracePeriodSeconds: 900 (15 minutes) to allow in-flight agent runs to complete on shutdown
  • ServiceAccount: hummingbird-agent
  • Secrets mounted from K8s Secret hummingbird-agent
  • Commented-out mounts for workflow ConfigMap and custom CA trust

rbac.yaml (applied in the sandbox namespace):

  • ServiceAccount hummingbird-agent
  • Role with pods: create, get, list, delete, patch and pods/exec: create
  • RoleBinding linking the SA to the Role
  • patch is required for pool mode (relabeling pods during claim)

sandbox-pool.yaml (applied in the sandbox namespace):

  • Deployment with replicas: 3 (tuned to balance latency vs cost)
  • Pods labelled hummingbird/role: standby for pool discovery
  • Same security context and resource limits as direct-creation pods
  • revisionHistoryLimit: 2, rolling update with maxSurge: 1, maxUnavailable: 0

networkpolicy.yaml:

  • podSelector: {} – applies to all pods in the namespace
  • egress: [] – deny all outbound traffic

secret.yaml:

  • Template with all required env vars (API keys, tokens, URLs)
  • Values must be populated per deployment

13.4. Workflow mounting

Workflows are baked into the image at /app/workflows/. To update workflows without rebuilding the image:

  1. Create a ConfigMap from the workflows directory
  2. Uncomment the volume mount in the deployment YAML
  3. The ConfigMap mount replaces the baked-in directory (all-or-nothing)

This is useful for rapid iteration in staging without waiting for a new image build.

13.5. Custom CA trust

For clusters with internal CA certificates (common in enterprise environments), the deployment supports OpenShift’s CA injection:

  1. Create a ConfigMap with the config.openshift.io/inject-trusted-cabundle label
  2. Mount it at /etc/pki/custom
  3. Set REQUESTS_CA_BUNDLE=/etc/pki/custom/ca-bundle.crt

OpenShift automatically injects the cluster’s CA bundle into the ConfigMap.

14. Design Decision Registry

Each entry records a decision, the alternatives considered, why the chosen approach won, and what would break if the decision were reversed. This is the most important section for avoiding regressions.

kubectl exec over kubernetes Python exec API

  • Chosen: kubectl exec as subprocess for sandbox command execution.
  • Alternative: kubernetes Python client stream() API.
  • Why: Three showstopper bugs in the Python client: (1) WebSocket v1-v4 has no stdin EOF signal, so cat > /file hangs forever; (2) BrokenPipeError on stdin larger than ~1MB; (3) stream() accumulates all stdout in memory with no control. kubectl avoids all three and provides the same subprocess interface as Podman.
  • If reversed: write_file() would hang or fail on large data. Pod log retrieval with large outputs would OOM. Sandbox reliability would drop significantly.

YAML config over env vars for operational settings

  • Chosen: Single YAML file for workflows, projects, limits, data source declarations.
  • Alternative: Everything in env vars (original design).
  • Why: Env vars cannot express structured data (workflow-project mappings, per-project token overrides, data source config with multiple fields). YAML provides structure, validation, and audit trails.
  • If reversed: Token scoping would be lost (no per-workflow-project token overrides). Project allowlists would be impossible. The config would be unauditable.

Token env var names in YAML, not values

  • Chosen: YAML contains token_env: GITLAB_TOKEN_RO (the env var name), not the actual token value.
  • Alternative: Inline secrets in YAML, or env vars for everything.
  • Why: The YAML file can be committed, reviewed, and audited. Actual secrets stay in env vars (injected via K8s Secrets). Inline secrets would make the config file a secret itself.
  • If reversed: The config file would become a secret, breaking audit trails and code review workflows.

Orchestrator token prefix convention (ORCHESTRATOR_*)

  • Chosen: Orchestrator tokens use ORCHESTRATOR_GITLAB_TOKEN_* env var names by convention.
  • Alternative: Same env var namespace as model tokens, distinguished by context.
  • Why: Structural separation makes it impossible to accidentally pass an orchestrator token to a model tool (or vice versa). The ORCHESTRATOR_ prefix is never used in YAML token_env fields.
  • If reversed: A misconfiguration could leak write-capable tokens to the model, which could then expose them via tool calls.

Auto-spill over conversation compaction

  • Chosen: Large outputs are saved to sandbox files with previews returned to the model.
  • Alternative: Replace old tool results in contents with compact summaries (conversation compaction).
  • Why: Compaction requires modifying the contents list, which risks violating Gemini API constraints (model turns must match preceding tool turns). Auto-spill achieves ~90% of the token reduction with zero risk of breaking the conversation structure.
  • If reversed: Token usage would increase ~15-30%. Per-call context would grow unbounded. Sessions would hit the context limit much sooner. Conversation compaction could be added on top of auto-spill in the future, but is not needed with current workloads.

Iteration count over cumulative token budget

  • Chosen: max_iterations as the primary cost control lever.
  • Alternative: Cumulative token budget (stop when total billed tokens exceed a threshold).
  • Why: With full history replay, each API call re-sends everything. Cumulative billed tokens double-count: call 1 = 5K, call 2 = 10K (including 5K again), total = 15K billed but only 10K new content. With auto-spill keeping per-call size bounded, iteration count is a much simpler and more predictable proxy for actual cost.
  • If reversed: The budget model would be confusing and inaccurate. Cost estimates would be wrong. The cumulative metric is still logged for observability, but it does not drive termination.

ToolDef notes in system prompt, not tool schema

  • Chosen: Domain knowledge (Konflux dual-fetch, TF XML structure) goes in ToolDef.notes, injected into the system prompt.
  • Alternative: Put everything in the tool schema description.
  • Why: Tool schemas have character limits and are sent in the tools field of every API call. Long descriptions waste tokens on tool defs. System prompt notes are sent once and can be arbitrarily detailed.
  • If reversed: Tool descriptions would be bloated. Domain knowledge would need to be duplicated in every workflow .md file.

Full history replay (no compaction)

  • Chosen: The contents list grows monotonically. Items are never removed or modified (except empty response retries).
  • Alternative: Compact old turns to reduce context size.
  • Why: The Gemini API requires strict turn-by-turn structure. Modifying or removing items risks invalid conversation structure. Auto-spill handles the growth problem at the source (preventing large items from entering history).
  • If reversed: Risk of Gemini API errors from malformed conversation structure. Risk of confusing the model (it references previous results that have been summarized away).

fetch_to_sandbox kept alongside auto-spill

  • Chosen: Both fetch_to_sandbox and auto-spill coexist.
  • Alternative: Remove fetch_to_sandbox since auto-spill handles large outputs automatically.
  • Why: fetch_to_sandbox provides explicit path control (model can choose meaningful filenames), batch fetching (one tool call for multiple fetches), and no-preview responses (useful when the model will process with jq anyway).
  • If reversed: The model would lose path control and batch fetching would require multiple auto-spilled calls. The workflow would need more iterations to achieve the same result.

K8s library for lifecycle + kubectl for exec (hybrid)

  • Chosen: Use the kubernetes Python library for pod create/read/delete, kubectl subprocess for exec.
  • Alternative: All-kubectl (subprocess for everything) or all-library.
  • Why: The library provides typed pod status polling and clean error handling for lifecycle. kubectl provides reliable exec (see “kubectl exec over kubernetes Python exec API” above). Using the library for exec would require solving the three WebSocket bugs.
  • If reversed: Either lifecycle management would be fragile (parsing kubectl JSON output for pod status) or exec would be unreliable.

Single-file YAML config over multiple config files

  • Chosen: Everything in one config.yaml with settings and workflows sections.
  • Alternative: Separate files per workflow, or settings.yaml + workflows.yaml.
  • Why: Single source of truth. All project-workflow mappings visible in one place. Settings-level defaults flow down to all projects. No file discovery logic needed.
  • If reversed: Token resolution would need cross-file lookups. Project index would need multi-file aggregation. Config validation would be more complex.

Workflow .md as pure prompt, metadata in YAML

  • Chosen: The workflow .md file is pure system prompt text. Metadata (action, model, max_iterations, data_sources, projects) lives in the YAML config.
  • Alternative: YAML frontmatter in the .md file (original design).
  • Why: Separation of concerns. The .md file is the model’s instructions – it should be readable and editable by anyone writing prompts. The YAML config is the orchestrator’s instructions – it controls routing, limits, and credentials. Mixing them in one file conflates two audiences.
  • If reversed: Prompt authors would need to understand YAML config structure. Token/project config would be scattered across .md files instead of centralized.

Raw HTTP for Gemini instead of google-genai SDK

  • Chosen: requests + google-auth for Gemini API calls.
  • Alternative: google-genai Python SDK.
  • Why: The SDK brings ~22 MB of transitive deps (pydantic, httpx, websockets). The Gemini REST API is simple camelCase JSON over HTTPS – one endpoint, one request format. Raw HTTP enables: smaller container, simpler tests (mock requests.post), plain dict conversation history (easy session serialization), no SDK breakage risk, and a models/ package structure that supports multiple backends.
  • If reversed: Container would be ~22 MB larger. contents would use SDK objects instead of plain dicts, complicating session serialization. Adding Claude support would require a separate approach.

Nudge via system prompt suffix, not user message

  • Chosen: Empty response nudge and budget warnings are appended to the system prompt as suffixes.
  • Alternative: Inject synthetic user messages into contents.
  • Why: User messages in contents must come from the actual user (the initial event, or a reply). Injecting synthetic messages pollutes the conversation history that is saved in sessions. System prompt suffixes are transient – they affect one API call without permanently modifying the conversation state.
  • If reversed: Saved sessions would contain synthetic user messages. Resumed conversations would be confusing. The model might respond to the synthetic messages instead of the user’s actual input.

Session ID reuse over new-ID-per-resume

  • Chosen: Resumed sessions keep the original session ID. S3 state is overwritten in place.
  • Alternative: Each resume generates a new UUID, saving alongside the original (append-only tree of sessions).
  • Why: GitLab discussions are linear, not branching. There is no scenario where two different sessions from the same thread are both valid. A new UUID per resume creates orphaned S3 snapshots that are never referenced again, since find_session_for_reply() always picks the latest marker. Reusing the ID is simpler, uses less storage, and matches the linear conversation model.
  • If reversed: S3 would accumulate orphaned session snapshots. Each resume would need a new note marker, but the old markers would still be in the thread, creating confusion about which session is current.

Reply to unauthorized agent-directed notes instead of silent skip

  • Chosen: When an unauthorized user sends a slash command or replies to an agent thread, the agent replies in the same discussion with a short access-denied message. Non-agent-directed notes are silently ignored (debug log, no access check).
  • Alternative: Log a warning and skip silently for all unauthorized notes (the original behavior).
  • Why: Silent skip gives no feedback to someone who intentionally tried to engage the agent, which is confusing. On the other hand, checking access and logging a warning for every random comment on a public project is noisy and pointless. Moving the access check after intent detection cleanly separates the two cases. The reply uses the discussions API so it automatically inherits confidentiality from the parent note/thread.
  • If reversed: Users without sufficient access who try /hummingbird would get no feedback. The agent log would be noisy with warnings for every comment on public MRs.

Deployment-backed pod pool over Python-thread pool

  • Chosen: A Kubernetes Deployment maintains pre-warmed standby pods. The agent claims a pod by relabeling it; the Deployment replaces it.
  • Alternative: A Python-side thread pool that pre-creates pods and queues them for use.
  • Why: The Deployment provides self-healing (restart crashed pods), native scaling (kubectl scale), rolling updates (image changes), and monitoring via standard K8s tooling. A Python pool would need to reimplement all of these.
  • If reversed: Pod replenishment, crash recovery, and image updates would all need custom code. Scaling would require agent redeployment.

Pod isolation: never reuse sandbox pods

  • Chosen: Each workflow run gets its own pod. After cleanup, the pod is deleted. Claimed pods are detached from the ReplicaSet.
  • Alternative: Return used pods to the pool and reset them.
  • Why: Residual state from a previous run (files, environment, running processes) could leak between MR investigations, creating security and correctness risks. Deletion is simple and foolproof.
  • If reversed: Would need a reliable pod-reset mechanism and auditing that no state survives between runs.

Absolute reap-by deadline over relative claimed-at age

  • Chosen: Active pods are annotated with hummingbird/reap-by (an absolute UTC timestamp). The reaper deletes pods where now > reap-by.
  • Alternative: Annotate with claimed-at and compute age relative to max_active_seconds; or use activeDeadlineSeconds on the pod spec.
  • Why: An absolute deadline simplifies the reaper to a single comparison. It also supports varying deadlines: claim sets reap-by = now + max_active_seconds, while linger sets reap-by = now + linger_seconds. With a relative timestamp, the reaper would need to know which mode the pod is in. activeDeadlineSeconds applies from pod creation, not from claim – standby pods would expire before being used.
  • If reversed: The reaper would need mode-aware age calculations. Lingering pods would require a separate annotation or reaper path.

Indefinite claim wait over timeout

  • Chosen: SandboxPool.claim() polls indefinitely (governed by shutdown event), logging at DEBUG then WARNING.
  • Alternative: Timeout after N seconds and raise an error.
  • Why: The pool is typically smaller than max_concurrent_agents for cost reasons. Waiting for a replacement pod is normal operational behavior, not an error. A timeout would cause spurious failures during burst traffic. The SQS semaphore already bounds concurrency.
  • If reversed: Burst traffic would cause avoidable failures instead of brief delays.

Pool config in Deployment only, not in agent configmap

  • Chosen: In pool mode (k8spool), the pod image, resources, labels, and security context are defined solely in the Deployment template. The agent config only needs namespace and active_deadline_seconds.
  • Alternative: Keep image/resources/metadata in the agent configmap too (as in k8s mode).
  • Why: Eliminates config duplication. The Deployment template is the single source of truth. Changes to pod resources or image only require updating one place and re-rolling the Deployment.
  • If reversed: Config drift between the Deployment and the agent configmap would be a constant risk.

Pod lingering over immediate cleanup for reply latency

  • Chosen: After a successful workflow, pool sandbox pods linger for linger_seconds (default 300) with a session-id annotation. Replies within the window reclaim the pod via try_reclaim(), skipping pod creation and S3 archive restoration.
  • Alternative: Always delete the pod immediately and restore from S3 on every reply.
  • Why: Reply latency drops from seconds (pod claim + S3 restore) to near-zero. The S3 archive is still saved as a fallback if the pod is gone. The reap-by annotation ensures lingering pods are cleaned up if no reply arrives. The session-id annotation doubles as a concurrency guard: it is only present while lingering, preventing active pods from being reclaimed.
  • If reversed: Every reply would pay full pod + restore latency, even for immediate follow-ups. User experience for conversational interactions would degrade noticeably.

linger() on Sandbox protocol over isinstance checks in runner

  • Chosen: All sandbox backends implement linger(session_id). Non-pool backends fall back to cleanup(). The runner calls sb.linger() without type checks.
  • Alternative: isinstance(sb, K8sPoolSandbox) in run_workflow().
  • Why: Keeps the runner backend-agnostic. Adding a new backend requires only implementing the protocol, not touching the runner. The fallback behavior is co-located with each backend.
  • If reversed: The runner would need to know about every backend type and their linger capabilities.

GCP SA key over Workload Identity Federation

  • Chosen: GCP service account key stored in Vault, rotated via cki-tools credential manager (prepare/switch/clean cycle).
  • Alternative: Workload Identity Federation (WIF) with projected SA token. Two sub-options: (a) automatic OIDC discovery if the cluster issuer is public, (b) manual JWKS upload for internal issuers.
  • Why: mpp-prod’s OIDC issuer is https://kubernetes.default.svc (internal, not publicly reachable), so GCP STS cannot discover it automatically. Manual JWKS upload works but requires re-upload after SRE-triggered signing key rotations. SA key integrates with the existing credential manager rotation infrastructure (same pattern as AWS keys and GitLab tokens), needs no OIDC reachability, and enables automated validate/update via google-auth in CI. The application code (VertexAuth / google.auth.default()) works identically with both approaches – switching to WIF later requires only infrastructure changes.
  • If reversed: Replace SA key with WIF projected token + credential config ConfigMap. The gcp_service_account_key token type in cki-tools would no longer be needed for this use case.

Sliding-window breakpoints over per-component breakpoints (Claude)

  • Chosen: Two breakpoints on messages (B1 at the previous write position, B2 at the latest message) that slide forward each turn.
  • Alternative: Separate breakpoints on system prompt, tools, and messages (using 3-4 of the 4 available slots).
  • Why: The Anthropic prefix hash is cumulative – it covers everything from the start of the request (tools, system, messages) up to the breakpoint. A single breakpoint on a message already caches the entire prefix. Separate breakpoints on earlier components would be redundant and waste slots. Two sliding breakpoints cover the full conversation history with cache reads on every turn after the first.
  • If reversed: Three breakpoint slots wasted on content already covered by the message breakpoint. Only one slot left for the sliding window, making it impossible to have both a read (B1) and write (B2) breakpoint on messages.

Developer+ trust filtering over unfiltered or sanitized discussion comments

  • Chosen: gitlab_get_mr_discussions checks each note author’s project access level (Developer+ / >= 30). Untrusted notes are replaced with a fixed placeholder in mixed-trust discussions, or the entire discussion is dropped if all notes are untrusted. Agent notes are always trusted (detected by session marker) but have transcripts and metadata stripped.
  • Alternatives considered:
    • (a) Include all comments unfiltered. Simplest, but any external contributor can craft comments that manipulate the model’s review output (prompt injection).
    • (b) Sanitize/escape untrusted content (strip markdown, quote as code blocks, prefix with “[external]”). There is no reliable escaping mechanism for LLM prompts – the model interprets natural language regardless of formatting. Escaping gives a false sense of security.
    • (c) Only include agent-authored notes (skip all human comments). Safe, but defeats the purpose: the model would never see developer responses to its own findings.
    • (d) Include only discussions where the agent participated. Better, but still misses developer-initiated review threads that provide relevant context.
  • Why: Developer+ is the same threshold used for pipeline trigger authorization (invariant #7) and slash command access. It matches the trust boundary already established: people who can push code and approve MRs are trusted to provide review context. The placeholder approach preserves discussion structure (the model sees that someone replied) without exposing the content. Full-drop for all-untrusted discussions avoids noise from discussions that contain zero useful context.
  • If reversed: (a) opens a prompt injection vector on any project that accepts external MRs. (b) provides no actual protection. (c) makes follow-up reviews unable to see developer explanations, causing repeated false positives. (d) misses developer-initiated context.

Bot-author filtering for session marker parsing over unfiltered note scanning

  • Chosen: find_session_for_reply(), scan_agent_threads(), and the handle_note() self-filter all resolve the bot user ID via get_bot_user_id() (which calls gl.auth() on the orchestrator token) and only consider notes where author.id matches. In find_session_for_reply(), the first matching marker wins.
  • Alternative: Parse markers from any note and keep the last match; substring self-filter in handle_note (previous behavior).
  • Why: The previous unfiltered approach allowed session hijacking: a user reply or agent review prose containing an example marker (<!-- hummingbird-session: ... -->) was parsed as a real session, causing “session expired” errors in production. For handle_note, the substring check caused the reverse problem: a user embedding the marker prefix in a reply silently suppressed the event (denial of service). The gl.auth() call is one GET /user per invocation – negligible cost. _get_client_with_bot_id() returns both the authenticated client and the bot user ID, avoiding redundant client construction.
  • If reversed: Any MR participant can inject a marker to hijack the session ID or inflate rate-limit counts. The agent’s own review prose containing example markers causes spurious “session expired” messages.

ijson streaming over buffered JSON for K8s list responses

  • Chosen: session.get(stream=True) + ijson.parse(resp.raw) for iter_paginated(). Items are yielded one at a time via ObjectBuilder; the metadata.continue pagination token is captured in the same parse pass. resp.raw.decode_content = True is required because Kubearchive returns gzip Content-Encoding; without it ijson sees compressed bytes.
  • Alternative: session.get().json() loads the full page into memory (the original implementation).
  • Why: A single K8s list page can contain 500 PipelineRuns (80+ MB JSON). Parsing this with .json() creates a +247 MB RSS spike (raw bytes + decoded string + parsed dict coexist). With 4 concurrent workflows, this exceeds any reasonable pod memory limit. ijson stream-parsing yields items one at a time, reducing the peak to +12 MB for the same data – a 95% reduction. The continue token appears in the metadata object (before or after items depending on the server); the event-driven parser captures it regardless of order.
  • If reversed: Large MRs (100+ components) would OOM the agent pod. Concurrent workflows would multiply the problem. The pod memory limit would need to scale with the largest possible page size, which is unbounded.