Hummingbird Agent Design
Architectural design document for the hummingbird-agent. Covers the reasoning behind every major design choice so that future changes can be made safely, without accidentally violating invariants that hold the system together.
For operational usage (CLI, config reference, deployment), see Hummingbird Agent. For the model loop wire format, see Agent Model Loop.
1. Design Philosophy
Five principles shaped the agent’s architecture. Every component traces back to at least one of these.
Security by default. The agent processes untrusted merge requests. All LLM-driven commands run inside isolated sandbox containers with no network and no credentials. Orchestrator write tokens and model read tokens live in separate code paths that never cross. No cluster-admin is required.
Config over code. Investigation logic lives in markdown workflow files that become the LLM system prompt verbatim – changing the analysis strategy requires no code change and no redeployment. Operational settings (data sources, project allowlists, iteration limits) live in a YAML config file that is auditable and committable. Secrets are the only thing in environment variables.
Bounded cost. Every output path has a size cap. Large tool outputs are auto-spilled to sandbox files with only a preview returned to the model. Each session has both an iteration limit and a per-call context token limit with two-tier escalation (soft warning, then hard stop with tool disabling). The relationship between these constants is documented and centralized.
Partial over nothing. When some data is unavailable (expired pod logs, unreachable Konflux cluster, Testing Farm 404), the agent continues with whatever data it has and notes the gap in its output. A partial report is more valuable than a crash.
Model-agnostic agent loop. The agent loop (agent.py) does not inspect
the internal structure of the contents list. It appends raw_content from
model responses and make_tool_responses() output without looking inside.
Each model backend owns its wire format. This makes it possible to add new
model backends (Claude, GPT) without touching the agent loop.
2. System Architecture
2.1. Component overview
flowchart TB
subgraph input [Event Sources]
SQS["SQS Queue<br/>(gitlab::pipeline, gitlab::note)"]
CLI["CLI<br/>(--event / --event-file)"]
end
subgraph orchestrator [Orchestrator Process]
Events["events.py<br/>SQS consumer"]
Runner["runner.py<br/>Event routing"]
Agent["agent.py<br/>Model loop"]
Tools["tools.py<br/>Tool registry"]
Actions["actions.py<br/>GitLab notes"]
Sessions["sessions.py<br/>Persistence"]
WfConfig["workflow_config.py<br/>YAML config"]
end
subgraph models [Model Backends]
Gemini["models/gemini.py<br/>Gemini API / Vertex AI"]
Anthropic["models/anthropic.py<br/>Anthropic via Vertex AI"]
end
subgraph sandbox [Sandbox Container]
SB["PodmanSandbox / K8sSandbox<br/>jq, python3, yq"]
SpillFiles["/tmp/data/_out/ spill files"]
DataFiles["/tmp/data/ working files"]
end
subgraph dataSources [Data Source Modules]
GL["gitlab.py"]
KX["konflux.py"]
TF["testing_farm.py"]
end
subgraph external [External APIs]
GitLabAPI["GitLab API"]
KonfluxAPI["K8s + Kubearchive"]
TFAPI["Testing Farm API"]
end
subgraph storage [Storage]
S3["S3 Sessions"]
MRNote["GitLab MR Notes"]
end
SQS --> Events --> Runner
CLI --> Runner
Runner --> Agent
Agent --> Tools
Tools --> SB
Tools --> dataSources
dataSources --> external
Agent --> Gemini
Agent --> Anthropic
Runner --> Actions --> MRNote
Runner --> Sessions --> S3
Runner --> WfConfig
2.2. Request lifecycle
A complete run proceeds through these stages:
-
Event arrival. An SQS message (production) or CLI invocation (dev) provides a GitLab project path and MR IID.
-
Config lookup.
workflow_config.pymaps the project to one or more workflows via theproject_index. Each match yields aWorkflowConfigandProjectEntrywith data source declarations and per-project settings. -
Rate-limit check and SHA dedup.
actions.scan_agent_threads()scans MR discussions in a single pass, parsing JSON session markers to determine the per-workflow thread count and whether the current SHA has already been reviewed. If the SHA was already reviewed, the workflow is skipped. If the thread count meets or exceedsmax_runs_per_mr, the workflow is skipped. Slash commands and replies bypass this check. -
Placeholder note.
actions.create_placeholder_note()posts a placeholder so the reviewer knows analysis is in progress. The note contains a JSON session marker (<!-- hummingbird-session: {"id":"UUID","wf":"name","sha":"abc"} -->) for rate limiting, SHA dedup, and session resumption. -
Sandbox start.
sandbox.create_sandbox()starts a Podman container or K8s pod./tmp/data/is pre-created for the model’s use. -
Data source registration.
data_sources.register_selected()resolves tokens and URLs from the config and registers tool functions on theToolRegistry. Only data sources declared in the workflow config are registered. -
Agent loop.
agent.run_agent_loop()runs the model loop: the workflow markdown becomes the system prompt, tool definitions are provided, and the model iterates calling tools and producing text until it emits a final response or hits a budget limit. -
Note update. The final text, wrapped with a session marker and reply prompt, replaces the placeholder note.
-
Session save. Conversation history (
contents), transcript, and sandbox archive are saved to S3 (production) or a local directory (dev). -
Sandbox cleanup. The container/pod is deleted. K8s pods also have a configurable
activeDeadlineSecondsbackstop (default 1800s) in case the orchestrator dies.
2.3. Architectural boundaries
The codebase is organized around four strict boundaries:
Runner (runner.py) is the orchestration layer. It owns event routing,
the placeholder/update note lifecycle, session save/load, and sandbox
lifecycle. It calls agent.run_agent_loop() but never reaches into the
agent’s internals.
Agent (agent.py) is the model loop. It knows about the model interface,
tool definitions, and the contents list, but nothing about GitLab, SQS,
sessions, or actions. It returns (text, usage, transcript, contents) and
is completely unaware of what happens with those values.
Tools (tools.py) bridge the agent and the sandbox/data-sources. The
agent calls tool_registry.execute(tool_call) and gets a dict back. It
never calls sandbox methods directly. This indirection is what enables
auto-spill: the tool registry can transparently save large outputs to
sandbox files and return previews.
Models (models/) own wire format conversion. Each model adapter’s
generate() accepts internal types (contents, ToolDef) and returns a
ModelResponse. make_user_content() and make_tool_responses() produce
the model-specific dicts that go into contents (Gemini and Anthropic
backends each implement the ModelAdapter protocol). The agent treats these
as opaque values – it appends them but never inspects their internal structure.
3. Module Architecture
3.1. Dependency graph
flowchart TD
main["__main__.py"]
config_mod["config.py"]
wf_config["workflow_config.py"]
events_mod["events.py"]
runner_mod["runner.py"]
agent_mod["agent.py"]
workflow_mod["workflow.py"]
tools_mod["tools.py"]
sandbox_mod["sandbox.py"]
actions_mod["actions.py"]
sessions_mod["sessions.py"]
transcript_mod["transcript.py"]
ds_init["data_sources/__init__.py"]
ds_gitlab["data_sources/gitlab.py"]
ds_konflux["data_sources/konflux.py"]
ds_tf["data_sources/testing_farm.py"]
http_mod["_http.py"]
config_watch_mod["config_watch.py"]
models_init["models/__init__.py"]
models_types["models/_types.py"]
models_gemini["models/gemini.py"]
models_anthropic["models/anthropic.py"]
main --> config_mod
main --> config_watch_mod
main --> wf_config
main --> events_mod
main --> runner_mod
main --> sessions_mod
main --> transcript_mod
main --> actions_mod
config_watch_mod --> config_mod
config_watch_mod --> wf_config
runner_mod --> actions_mod
runner_mod --> agent_mod
runner_mod --> config_mod
runner_mod --> ds_init
runner_mod --> sandbox_mod
runner_mod --> sessions_mod
runner_mod --> tools_mod
runner_mod --> workflow_mod
runner_mod --> wf_config
runner_mod --> transcript_mod
agent_mod --> config_mod
agent_mod --> models_init
agent_mod --> tools_mod
tools_mod --> config_mod
tools_mod --> models_init
tools_mod --> sandbox_mod
workflow_mod --> config_mod
workflow_mod --> models_init
ds_init --> tools_mod
ds_init --> wf_config
ds_init --> ds_gitlab
ds_init --> ds_konflux
ds_init --> ds_tf
ds_konflux --> http_mod
sessions_mod --> sandbox_mod
models_init --> models_types
models_init --> models_gemini
models_init --> models_anthropic
models_gemini --> http_mod
models_gemini --> models_types
models_anthropic --> http_mod
models_anthropic --> models_types
3.2. Module responsibilities
Each module has a single, well-defined responsibility. The boundary rules below are invariants – violating them breaks the security model or the separation of concerns.
config.py – Runtime configuration and constants
- Owns:
Configdataclass,estimate_cost(), all tuning constants (DEFAULT_MAX_ITERATIONS,CONTEXT_LIMIT,OUTPUT_PREVIEW_BYTES, etc.) - Boundary: Pure data. No I/O except reading env vars in
Config.load(). No imports from other hummingbird modules. - Invariant: All interdependent budget/spill constants must be defined here with their relationship documented in the comment block.
workflow_config.py – YAML config loader and token resolution
- Owns:
AgentConfig,WorkflowConfig(withtrigger,description,ignore_users,ignore_branchesfields),ProjectEntry,DataSourceConfigdataclasses;load(),validate_env(),resolve_tool_token(),resolve_cluster_url(),get_prompt(). - Boundary: Reads YAML from disk and env vars. Never instantiates network clients or model objects.
- Invariant:
resolve_tool_token()resolves model tool tokens only. It must never return orchestrator tokens. The resolution chain is:project_entry.tokens[ds] -> workflow_cfg.data_sources[ds].token_env -> "" (empty).
events.py – SQS consumer
- Owns:
decode_sns_message()(SNS envelope decoding),poll_loop()(blocking SQS consumer with concurrency control). - Boundary: Knows about SQS/SNS wire formats. Calls a generic
EventHandlercallback. Does not know about GitLab, workflows, or the agent. - Invariant: Failed messages are not deleted from SQS (they go to the DLQ after visibility timeout expires). Successful messages are deleted after handler returns.
config_watch.py – Background config hot-reload
- Owns:
ConfigHolder(thread-safe config pair with atomic swap),start_watcher()(daemon thread that polls config file mtime),_watch_loop(). - Boundary: Knows about
config.Config.load()andworkflow_config.load(). Does not know about events, the agent, or any runtime state. - Invariant: On reload failure (parse error, missing env var), the previous config is kept and a warning is logged. The watcher never crashes the serve loop.
runner.py – Event routing and workflow execution
- Owns:
handle_event(),handle_pipeline(),handle_merge_request(),handle_note(),_execute_workflow(),_handle_reply(),run_workflow(),_acquire_sandbox(),_resolve_slash_workflow(),_format_help(),_is_ignored_user(),WorkflowRequest,WorkflowResult,SandboxOpts. - Boundary: Orchestrates everything: config lookup, rate limiting,
placeholder notes, sandbox lifecycle, agent invocation, session save.
This is the only module that touches both
actionsandagent. - Invariant:
run_workflow()either lingers the sandbox (success) or cleans it up (exception).
agent.py – Model-agnostic tool-calling loop
- Owns:
run_agent_loop(), system prompt constants (BASE_SYSTEM_PROMPT,CONTINUATION_PROMPT, warning/nudge strings), budget logic. - Boundary: Knows about
models(theModelAdapterprotocol:generate()and content construction) andtools(forexecute()). Does not know about GitLab, SQS, sessions, or actions. Does not know which model backend is in use. - Invariant: The
contentslist is treated as opaque. Items are appended viaresponse.raw_contentandmodel.make_tool_responses(). The agent never inspects, modifies, or deletes items in the list (except for empty response retries, whichpop()the last appended item before the model has seen any tool results for it).
tools.py – Tool registry
- Owns:
ToolRegistryclass withsandbox_exec,fetch_to_sandbox,fetch_batch_to_sandboxbuilt-in tools; data source registration and dispatch; auto-spill logic. - Boundary: Owns the sandbox reference and all tool execution. The agent never touches the sandbox directly.
- Invariant:
_spill_counteris shared across all spill paths (sandbox exec stdout, sandbox exec stderr, data source auto-spill) to prevent filename collisions in/tmp/data/_out/.
sandbox.py – Sandbox backends
- Owns:
Sandboxprotocol,PodmanSandbox,K8sSandbox,K8sPoolSandbox,SandboxPool,create_sandbox(),ExecResult. - Boundary: Knows about container/pod lifecycle and command execution. Does not know about tools, models, or the agent.
- Invariant: All backends must implement the same 8-method protocol
(
start,exec,write_file,stream_to_file,stream_exec,read_file_iter,cleanup,linger).exec()accepts optionalstdin_datafor piping raw bytes. All must pre-create/tmp/data/instart(). All must usestdinpiping forwrite_file()(never host volume mounts).
actions.py – GitLab note lifecycle
- Owns: Orchestrator token resolution (
resolve_orchestrator_token()), note CRUD, JSON marker parsing (marker_tag,parse_marker),scan_agent_threads(per-workflow rate limiting + SHA dedup),post_simple_reply, member access checks, session-for-reply lookup,AgentThreadInfo,SessionRef. - Boundary: Uses orchestrator tokens exclusively. Never touches model tool tokens, the tool registry, or the agent.
- Invariant: Orchestrator tokens are resolved from
ORCHESTRATOR_*env vars only, never from the YAML config. TheORCHESTRATOR_prefix makes them structurally impossible to confuse with model tokens.
sessions.py – Session persistence
- Owns:
archive_sandbox(),save_local(),save_s3(),load_s3(),load_local(),restore_sandbox(),SessionData(includingformat_versionfor canonical vs legacy session JSON). - Boundary: Knows about sandbox (for archiving) and S3/filesystem (for
storage). Does not know about the agent, tools, or models. Load/save detects
canonical envelope (
format_version: 1) vs legacy rawcontentslists. - Invariant: The sandbox archive is never extracted on the orchestrator.
It is created inside the sandbox (
tar czf), streamed out viaread_file_iter(), and restored inside a new sandbox viastream_exec("tar xzf -").
transcript.py – Transcript rendering
- Owns:
render_markdown(), per-tool rendering functions, truncation. - Boundary: Pure transformation from
TranscriptEntrylist to markdown. No I/O, no side effects.
data_sources/__init__.py – Data source registration
- Owns:
register_selected()– resolves tokens/URLs from config and calls each data source’sregister()function. - Boundary: Bridges
workflow_config(for token/URL resolution) andtools(for registration). Only registers data sources declared in the workflow config.
data_sources/gitlab.py, konflux.py, testing_farm.py
- Own:
register()function that creates tool definitions and closures over credentials, and registers them on theToolRegistry. - Boundary: Each module talks to one external API. Credentials are captured at registration time via closure, never stored globally.
models/_types.py – Shared data classes
- Owns:
ModelError,ToolDef,ToolCall,Usage,ModelResponse,TranscriptEntry;ModelAdapterprotocol (implemented by model backends). - Boundary: Pure data. No imports from other hummingbird modules.
_http.py – Shared HTTP infrastructure
- Owns:
new_session()(requests session with retry adapter and configurable 429 handling),VertexAuth(Google ADC credentials). - Boundary: Used by model backends (
models/gemini.py) and data sources (data_sources/konflux.py). No model-specific or data-source- specific logic.
models/gemini.py – Gemini model adapter
- Owns:
GeminiModelwithgenerate(),make_user_content(),make_tool_responses(),to_canonical(),from_canonical(). - Boundary: Translates between internal types and the Gemini REST API. Supports API key (direct) and Vertex AI (ADC) authentication modes.
- Note: Parses
cachedContentTokenCountfrom Gemini’s implicit server-side caching intoUsage.cache_read_tokens.
models/anthropic.py – Anthropic model adapter (Vertex AI)
- Owns:
AnthropicVertexModelwithgenerate(),make_user_content(),make_tool_responses(),to_canonical(),from_canonical(). - Boundary: Translates between internal types and the Anthropic Messages
API via Vertex AI
rawPredict. - Note:
make_tool_responses()uses_last_tool_idsinstance state to pair tool results with tool-use IDs. - Note:
generate()merges consecutive user messages before sending (Anthropic requires strict user/assistant alternation). - Note:
generate()annotates messages with sliding-window cache breakpoints (cache_control) on shallow copies to avoid mutatingcontents. Theephemeralparameter skips cache writes for transient warning messages.
4. Security Model
The agent runs in a shared OpenShift cluster without cluster-admin access, processing merge requests from repositories where external contributors can submit code. The security model addresses two threat vectors: (1) the LLM executing arbitrary commands chosen by an attacker-controlled MR, and (2) credential leakage between the model, the sandbox, and orchestrator actions.
4.1. Sandbox isolation
All commands generated by the LLM run inside an ephemeral container, never on the orchestrator host. The sandbox has no credentials, no network, and no visibility into the orchestrator.
flowchart LR
subgraph orchestrator [Orchestrator]
Agent["Agent Loop"]
Creds["Secrets<br/>(tokens, kubeconfig)"]
end
subgraph sandbox [Sandbox Container]
Shell["sh -c commands"]
Files["/tmp/data/<br/>(incl. _out/ spill dir)"]
end
Agent -->|"stdin pipe<br/>(write_file)"| sandbox
Agent -->|"exec command"| sandbox
sandbox -->|"stdout/stderr"| Agent
sandbox -.-x|"NO network"| Internet["Internet / K8s API"]
sandbox -.-x|"NO access"| Creds
Podman (local development):
--network=none– complete network isolation; curl, wget, pip install all fail--user 65532– fixed non-root UID; no privilege escalation- No host volume mounts – data enters only via
stdinpiping throughwrite_file()
Kubernetes (production):
Every field in the pod manifest is set explicitly for portability to vanilla
Kubernetes with Pod Security Admission (restricted level), not just reliance
on OpenShift’s restricted-v2 SCC admission:
automountServiceAccountToken: false– no K8s API access from sandboxrunAsNonRoot: true– enforced at pod levelseccompProfile: RuntimeDefault– required by restricted levelallowPrivilegeEscalation: false– container levelcapabilities.drop: ["ALL"]– container levelactiveDeadlineSeconds: 1800– pod self-terminates after 30 minutes even if the orchestrator crashes; prevents orphaned podsrestartPolicy: Never– pod does not restart on failureNetworkPolicyon the sandbox namespace blocks all egress from all pods in the namespace (podSelector: {},egress: [])
The pod does not set runAsUser explicitly. On OpenShift, the SCC assigns a
UID from the namespace range. On vanilla K8s, the image’s USER directive is
used.
4.2. Credential separation (three-tier token model)
Tokens are split into three tiers with strict code-path separation:
flowchart TB
subgraph tier1 [Tier 1: Orchestrator Tokens]
OT["ORCHESTRATOR_GITLAB_TOKEN_*<br/>Write scope (api)<br/>Env vars only, never in YAML"]
end
subgraph tier2 [Tier 2: Model Tool Tokens]
MT["GITLAB_TOKEN_RO, etc.<br/>Read scope (read_api)<br/>Env var NAMES in YAML"]
end
subgraph tier3 [Tier 3: Sandbox]
SB["Zero credentials<br/>Zero network<br/>Zero SA token"]
end
OT -->|"used by"| Actions["actions.py<br/>(notes, rate limits)"]
MT -->|"used by"| DataSources["data_sources/<br/>(GitLab, Konflux, TF)"]
SB -->|"used by"| SandboxExec["sandbox_exec<br/>(sh -c commands)"]
Actions -.-x|"NEVER"| DataSources
Actions -.-x|"NEVER"| SandboxExec
DataSources -.-x|"NEVER"| Actions
DataSources -.-x|"NEVER"| SandboxExec
Tier 1 – Orchestrator tokens. Write-capable GitLab tokens
(api scope) used by actions.py for posting/editing notes, rate-limit
counting, and member access checks. Resolved by convention from env vars:
ORCHESTRATOR_GITLAB_TOKEN_<MANGLED_PROJECT> (per-project) or
ORCHESTRATOR_GITLAB_TOKEN (global fallback). The ORCHESTRATOR_ prefix is
a structural safeguard – these env var names can never appear in the YAML
config’s data_sources or tokens sections because those sections
reference model tool token names, which never start with ORCHESTRATOR_.
Tier 2 – Model tool tokens. Read-only tokens (read_api scope for
GitLab) used by data source modules during tool execution. Declared in the
YAML config as env var names (not values):
data_sources:
gitlab:
token_env: GITLAB_TOKEN_RO # name of the env var
Per-project overrides are possible:
projects:
redhat/hummingbird/containers:
tokens:
gitlab: GITLAB_TOKEN_CONTAINERS_RO # overrides token_env for this project
The resolution chain in resolve_tool_token() is:
project_entry.tokens[ds] -> workflow_cfg.data_sources[ds].token_env -> ""
Storing env var names (not values) in YAML means the config file is safe to commit and audit. Actual secret values live in environment variables, injected via K8s Secrets at deployment time.
Tier 3 – Sandbox. The sandbox container has zero credentials, zero
network access, and automountServiceAccountToken: false (no K8s API
access). Data enters the sandbox only via write_file() (stdin piping).
The model cannot instruct the sandbox to reach external APIs – it must use
the orchestrator’s data source tools.
4.3. Namespace separation
Sandbox pods run in a dedicated namespace (hummingbird--agent-sandbox),
separate from the orchestrator namespace (hummingbird--internal). This
limits blast radius: even if a sandbox pod is compromised, it has no
visibility into the orchestrator’s Secrets, Pods, or ServiceAccount tokens.
RBAC setup:
The orchestrator’s ServiceAccount gets a Role in the sandbox namespace (not its own namespace) granting only:
pods: create, get, list, delete, patch– sandbox pod lifecycle and pool claimspods/exec: create– command execution viakubectl exec
No CRDs, no custom runtimes, no cluster-scoped resources. The orchestrator needs only namespace-scoped permissions, so it works with standard OpenShift RBAC without requesting cluster-admin.
Konflux data is fetched via bearer token from kubeconfig credentials, not from inside the cluster. The orchestrator’s ServiceAccount does not need access to Konflux namespaces.
NetworkPolicy:
A deny-all-egress NetworkPolicy in the sandbox namespace uses podSelector: {}
to match all pods and sets egress: []. The sandbox cannot reach the internet,
the K8s API, or other pods in the cluster.
4.4. Security invariants
These must hold for the security model to be effective. Any change that violates one of these is a security regression:
-
Orchestrator tokens must NEVER flow into
ToolRegistry,data_sources, or modelcontents. They are resolved inactions.pyonly. -
Model tool tokens must NEVER flow into
actions.py. They are resolved inworkflow_config.pyand consumed indata_sources/. -
The sandbox must NEVER have network access or credentials. No host mounts, no SA token, no env vars with secrets.
-
The sandbox archive is NEVER extracted on the orchestrator. It is created inside one sandbox and restored inside another. The orchestrator only transports the bytes.
-
No cluster-admin required. Only namespace-scoped resources (Role, RoleBinding, Pod, NetworkPolicy) are used.
-
Agent-generated notes are skipped on re-processing. Notes containing
SESSION_MARKER_PREFIXare filtered out inhandle_note()to prevent infinite loops. -
All auto-triggers require Developer access.
handle_pipeline(),handle_merge_request(), andhandle_note()each checkcheck_member_access()on the event’s user before executing a workflow. Events from non-developers are silently skipped (or replied to with an access-denied message for slash commands). This ensures that the target MR’s work products – its diff, description, CI logs, and commit messages – originate from a trusted author. Fork MRs from external contributors are blocked because the MR author lacks Developer+ on the target project.Scope limitation: this gate only covers the target MR. Once the agent is running, the model controls tool arguments and can direct tools at content beyond the target MR – other MRs, other refs, other job IDs, even other projects reachable by the read-only token. The current mitigations are: (a) workflow prompts instruct the model to use the event’s
{project, iid, sha}, (b) the model would need to be manipulated via prompt injection from already-trusted content to deviate, (c) the sandbox prevents the model from acting on manipulated reasoning beyond producing text output, and (d) the token is read-only with minimal scope. However, if a data source tool is added that returns user-authored prose from arbitrary resources (e.g. issue bodies, wiki pages), it should apply per-author trust filtering (invariant #9). -
Data source tools that accept URLs must validate them against an allowlist.
tf_get_test_logrestricts URLs to the Testing Farm artifacts prefix to prevent the LLM from directing the orchestrator to fetch arbitrary URLs (SSRF). -
Third-party commentary entering the model prompt must be trust-filtered. Data sources that feed text from users other than the event trigger into
contents(discussion comments, issue bodies) must gate on project membership at Developer+ level per author. Content from untrusted authors must be redacted to a fixed placeholder, never sanitized or escaped – there is no reliable way to escape adversarial text for an LLM prompt. Discussions where all notes are untrusted must be dropped entirely. See §9.1 for the reference implementation ingitlab_get_mr_discussions.This invariant covers commentary (what other people said about the MR), not work products (the diff, CI logs, commit messages). Work products are inherently the input to the agent’s analysis and cannot be content-filtered without defeating the agent’s purpose. For the target MR, their trust comes from invariant #7 (the event trigger is Developer+). For content the model fetches beyond the target MR, trust depends on the tool: discussion tools must filter per-author (#9), while code/log/metadata tools rely on the mitigations described in #7.
5. Configuration System
5.1. Two sources, strict separation
Configuration comes from exactly two sources with no overlap:
- YAML config file (
CONFIG_PATH): operational settings, workflow definitions, project allowlists, data source declarations, and token env var names. This file is safe to commit, review, and audit. - Environment variables: secrets only (API keys, GitLab tokens, kubeconfig paths). These are injected at deployment time via K8s Secrets.
This split exists because YAML provides structure, validation, and audit trails, while secrets must stay out of version control.
5.2. Config loading
Two config objects are built at startup:
Config (from config.py): loaded by Config.load(settings), where
settings is the settings: section from the YAML file. Auth/bootstrap
fields come from env vars (GOOGLE_API_KEY, GOOGLE_CLOUD_PROJECT, etc.).
Operational fields come from the settings dict with sensible defaults.
AgentConfig (from workflow_config.py): loaded by
workflow_config.load(path). Contains all workflow definitions, project
entries, and a pre-built project_index.
The two are kept separate because Config is needed everywhere (model
construction, sandbox creation, session storage), while AgentConfig is
only needed for event routing and data source registration.
5.3. YAML config structure
settings:
gitlab_url: https://gitlab.com # operational, not a secret
sandbox:
image: quay.io/.../image:tag
namespace: my-namespace # K8s only
model: gemini-3.1-pro-preview
# model: claude-sonnet-4-5@20250929 # Anthropic via Vertex AI (alternative)
max_iterations: 30
max_runs_per_mr: 5
internal_notes: true
max_concurrent_agents: 4
sqs_queue_url: "" # empty = no SQS (local dev)
s3_session_bucket: "" # empty = no S3 (local dev)
trigger_prefix: /hummingbird # slash command prefix
session_marker_prefix: hummingbird-session # HTML comment marker ID
auto_trigger: true # auto-run on failed pipelines
workflows:
analyze-failures:
prompt: workflows/analyze-failures.md # relative to config file dir
action: post_gitlab_note
model: gemini-3.1-pro-preview # per-workflow override
max_iterations: 50 # per-workflow override
data_sources:
gitlab:
token_env: GITLAB_TOKEN_RO # env var name, not the value
konflux:
cluster_url: https://example.com:6443/ns/my-tenant
kubeconfig_env: KUBECONFIG
kubearchive_url: https://kubearchive-api-server-product-kubearchive.apps.example.com
testing_farm: {}
projects:
redhat/hummingbird/containers:
tokens: # per-project token overrides
gitlab: GITLAB_TOKEN_CONTAINERS_RO
Design choices in this structure:
Workflow-first organization. Each workflow owns its project list, not the other way around. This scopes data source permissions per workflow-project pair. A future code-review workflow can have different GitLab tokens (with different scopes) than the analyze-failures workflow, with no ambiguity.
Token env var names in YAML (not values). The YAML file is committed to
the repo. It contains token_env: GITLAB_TOKEN_RO (a name), not the actual
token. The actual secret value is resolved at runtime via
os.environ.get(env_var). This allows the config to be reviewed and
audited without exposing secrets.
Inline cluster_url only. The Konflux cluster URL is operational
configuration (it identifies which cluster to talk to), not a secret.
Putting it inline in YAML makes it visible and auditable. The URL must
include the namespace path (e.g. /ns/my-tenant).
5.4. Project index
At load time, _build_project_index() constructs a reverse lookup:
project_index: dict[str, list[tuple[str, WorkflowConfig, ProjectEntry]]]
# e.g. {"redhat/hummingbird/containers": [("analyze-failures", wf_cfg, proj_entry)]}
This provides O(1) lookup when a pipeline event arrives with a project
path. A single project can appear in multiple workflows (e.g. both
analyze-failures and a future code-review), and each matching workflow
will be triggered independently.
5.5. Token resolution chains
Model tool tokens (for data source API calls):
resolve_tool_token(project_entry, ds_name, workflow_cfg):
1. project_entry.tokens[ds_name] -> per-project override
2. workflow_cfg.data_sources[ds_name].token_env -> workflow default
3. "" (empty string) -> no token
Each step resolves the env var NAME, then reads os.environ[name].
Orchestrator tokens (for GitLab note operations):
resolve_orchestrator_token(project_path):
1. ORCHESTRATOR_GITLAB_TOKEN_<MANGLED_PROJECT> -> per-project
2. ORCHESTRATOR_GITLAB_TOKEN -> global fallback
Mangling: "/" -> "_", "-" -> "_", uppercase.
e.g. "redhat/hummingbird/containers" -> ORCHESTRATOR_GITLAB_TOKEN_REDHAT_HUMMINGBIRD_CONTAINERS
Cluster URL (for Konflux):
resolve_cluster_url(ds_cfg):
-> ds_cfg.cluster_url (inline value in YAML)
5.6. Environment validation
validate_env(agent_cfg) checks at startup that all referenced env vars
exist. It walks every workflow’s data sources and project token overrides,
collecting missing vars into a single error message. This catches
configuration errors early instead of failing mid-run when a specific
data source is first used.
6. Agent Loop Design
For the wire-format walkthrough (what bytes go to the model API, what comes back), see Agent Model Loop. This section covers the design rationale behind the loop.
6.1. Full history replay
The Gemini and Anthropic APIs are stateless. Every call sends the complete contents list
from the beginning of the conversation. This means every large tool output
sitting in history inflates every subsequent API call.
This property is fundamental to why auto-spill exists (section 7). Without
auto-spill, a single cat of a 32KB file early in the conversation adds
~8K tokens to every remaining API call. Over a 15-iteration run, that is
~120K wasted tokens.
The alternative – conversation compaction (replacing old tool results with summaries) – was considered and deferred. Each provider has strict requirements about content structure (e.g. Gemini model turns must match the preceding tool turns; Anthropic enforces user/assistant alternation), and modifying history risks confusing the model or violating API constraints. Auto-spill solves 90% of the problem with none of the risk.
6.2. Budget model: iterations + context limit
The agent uses a dual-limit approach rather than a cumulative token budget:
Iteration limit (max_iterations, default 30 per workflow config).
Hard cap on the number of model round-trips. This is the primary cost
control lever. With auto-spill keeping per-call context bounded, iteration
count is roughly proportional to cost.
Per-call context limit (CONTEXT_LIMIT, default 60,000 tokens).
Checked after each API call using response.usage.input_tokens. This is
a safety net for cases where auto-spill is not sufficient (e.g., many
small tool results that individually stay under the spill threshold but
cumulatively fill the context).
Why not a cumulative token budget? Because with full history replay, each API call re-sends everything. “Cumulative billed tokens” double-counts: call 1 sends 5K, call 2 sends 10K (including the 5K again), so billed total is 15K but actual new content is only 10K. Iteration count is a simpler and more predictable proxy for cost.
6.3. Two-tier budget escalation
Both limits use the same escalation pattern:
flowchart LR
Normal["Normal<br/>tools available"]
Soft["Soft Warning (80%)<br/>ITERATION_WARNING or<br/>CONTEXT_WARNING<br/>tools still available"]
Hard["Hard Stop (100%)<br/>FINAL_TURN_WARNING<br/>toolConfig: NONE"]
Normal -->|"80% reached"| Soft
Soft -->|"100% reached"| Hard
Soft warning (80%). An ephemeral user message (ITERATION_WARNING) is
injected into contents for that turn only, then popped before the
response is persisted. Tools remain available so the model can finish
in-progress work.
Hard stop (100%). FINAL_TURN_WARNING is injected as an ephemeral
user message AND tool_defs is set to [] (empty list) to physically
prevent further tool calls. The model must produce text. This is more
reliable than disabling tools at the API layer (toolConfig.functionCallingConfig.mode: NONE
on Gemini, tool_choice: {"type": "none"} on Anthropic), which models
sometimes ignore (Gemini may return UNEXPECTED_TOOL_CALL).
6.3.1 Ephemeral messages and Anthropic alternation
All per-turn warnings use ephemeral user messages: they are appended to
contents before the API call and popped immediately after. This keeps
the system prompt stable across all iterations and prevents warnings from
polluting the conversation history saved in sessions.
Anthropic requires strict user/assistant alternation. The Anthropic adapter’s
generate() merges consecutive user messages on a copy of contents before
sending the request, so ephemeral warnings (and other adjacent user turns) do
not break the API contract.
6.4. Empty response handling
Models occasionally return empty responses (no text, no tool calls). The
agent retries up to MAX_EMPTY_RETRIES (2) times:
- Pop the empty response from
contents. It adds nothing and may confuse the model on the next call. - Inject
EMPTY_RESPONSE_NUDGEas an ephemeral user message on the next turn: “Your previous response was empty. Please continue…” - Continue the loop (consuming an iteration).
The nudge is delivered as an ephemeral user message (injected before the API call and popped after). This avoids mutating the system prompt and keeps the conversation history clean for session persistence.
The empty_retries counter resets to 0 after any successful iteration
(one where tool calls were executed). This means the model gets fresh
retries if it produces empty responses at different points in the
conversation.
MALFORMED_FUNCTION_CALL handling. Models sometimes return a
finishReason of MALFORMED_FUNCTION_CALL (Gemini) with no usable tool calls.
This is treated as a special case of empty response: the retry mechanism
kicks in, but the nudge is replaced with MALFORMED_CALL_NUDGE which
tells the model to retry with simpler arguments and avoid large text
payloads in tool call arguments.
6.5. Error recovery
Model API errors. _generate_with_retry() catches ModelError and
retries up to MODEL_RETRY_COUNT (4) times if retryable is True (HTTP
5xx and 429). Non-retryable errors (4xx, auth failures) fail immediately.
Retries use exponential backoff: MODEL_RETRY_BASE_DELAY * 2^attempt,
capped at MODEL_RETRY_MAX_DELAY (60s), giving delays of 5s, 10s, 20s,
40s. Transport-level 429 retry is disabled on the model’s HTTP session
(retry_429=False) so that rate-limit responses are handled at the model
retry layer with proper backoff instead of being silently retried by
urllib3.
Transport errors. Each model adapter’s generate() catches
requests.Timeout and requests.ConnectionError from the HTTP call and
wraps them as retryable ModelError. This makes timeouts (read and
connect) and connection failures subject to the same retry logic as HTTP
5xx/429. Timeout is caught before ConnectionError because
ConnectTimeout inherits from both. Note that the urllib3 retry adapter
does not retry POST requests (not idempotent), so transport errors from
model calls always propagate to our code.
Unexpected loop errors. run_agent_loop wraps the
_generate_with_retry call in a try/except Exception that logs and
breaks instead of propagating. This ensures the function always returns
partial results (accumulated contents, transcript, sandbox state)
even when an unexpected exception occurs (e.g. JSONDecodeError from a
truncated API response). The caller saves the session normally –
conversation history and sandbox archive are preserved for resumption.
This follows the “partial over nothing” principle: 32 iterations of work
are more valuable than a crash.
No failure-path session save. _execute_workflow does not save a
session when run_workflow raises. With the agent loop catching
unexpected errors, the failure path only fires for infrastructure errors
(sandbox start, config) where there is no useful state. Not saving
avoids overwriting a previous good session when a reply attempt fails.
Tool execution errors. _execute_tool_calls() wraps each
tool_registry.execute() in a try/except. Unhandled exceptions are caught
and returned to the model as {"error": "Tool X failed: ..."}. This
prevents a single broken tool from crashing the entire session – the model
sees the error and can adapt.
Sandbox exec timeout. If a command exceeds EXEC_TIMEOUT (120s), the
TimeoutExpired exception is caught in the tool registry and returned as a
structured error dict. The model can retry with a different command or
proceed without the result.
6.6. Session resumption
When resuming from a previous session:
initial_contents(the fullcontentslist from the previous run) is prepended, followed by the user’s reply as a new user turn.CONTINUATION_PROMPTis appended to the system prompt.- The sandbox is restored from the archived tarball.
- A fresh iteration budget starts from 0.
Saved sessions use a versioned JSON envelope (format_version: 1) whose
contents are in a canonical (OpenAI-style) message format, independent of
whether the run used Gemini or Anthropic. On resume, run_workflow converts
initial_contents into the active model’s native wire format: for
format_version >= 1, via model.from_canonical(); for legacy sessions
without a version, GeminiModel.to_canonical() migrates Gemini-native history
to canonical form first, then model.from_canonical() loads it into the
current backend. After a successful run, model.to_canonical() converts the
native contents back to canonical form before persistence.
CONTINUATION_PROMPT is critical. Without it, the workflow prompt (e.g.,
analyze-failures.md) tells the model to follow a rigid multi-phase
workflow: Data.1, Data.2, Analysis.1… The model would try to re-run the
entire analysis. The continuation prompt overrides this: “Do NOT re-run the
full workflow. Respond directly to the user’s question.”
The practical constraint on resumed sessions is the context window, not
iterations. The restored history from an 8-iteration cold start uses
~50-100K input tokens. Each new tool call adds more. The existing
CONTEXT_LIMIT check still applies and will force wrap-up if the context
grows too large.
Canonical session format (rationale)
Storing sessions in canonical form decouples persisted history from any one
provider’s JSON shape. The same saved thread can be resumed on a different
model adapter (including cross-provider migration) because conversion happens
at load and save boundaries only; the agent loop continues to treat native
contents as opaque between those steps.
6.7. Prompt caching
Since the agent replays the full conversation history on every API call (see 6.1), prompt caching reduces the cost of re-processing unchanged prefixes. The two model backends handle caching differently.
Gemini. Vertex AI caches prefixes implicitly on the server side. The
adapter parses cachedContentTokenCount from usageMetadata and reports it
as cache_read_tokens. No request-side changes are needed.
Claude. Vertex AI does not support Anthropic’s automatic caching
(which requires opt-in at the API level and is not yet available on Vertex).
The adapter uses explicit cache_control: {"type": "ephemeral"} annotations
on message content blocks. Up to 4 breakpoint slots are available per
request; the agent uses 2.
Why 2 breakpoints on messages, not 4 on system/tools/messages. The Anthropic prefix hash is cumulative: it covers everything from the start of the request (tools, system prompt, messages) up to the annotated block. A single breakpoint on a message already caches the entire prefix including tools and system. Separate breakpoints on earlier components would be redundant and waste slots. Using only 2 of the 4 slots leaves room for future use.
Why a sliding window. The conversation grows by 2 messages per turn (assistant response + user tool result). Two breakpoints slide forward in lockstep:
- B2 on the last message writes the full prefix to cache.
- B1 on the previous B2 position reads the prior prefix from cache.
Each call after the first gets a cache hit for the entire prefix minus the
newest turn. _prev_cache_index on the model instance tracks where B2 was
placed so that B1 can be positioned on the next call.
Why shallow copies. The contents list is owned by the agent loop and
persisted in sessions. _annotate_cache_breakpoints() creates a shallow
copy of the messages list and replaces only the B1/B2 entries with copies
that have cache_control injected. The originals are never mutated, so no
cleanup is needed after the API call and cache_control never leaks into
saved sessions.
Ephemeral message interaction. When the agent loop injects a transient
message (see 6.3.1), generate() receives ephemeral=True. B2 is not
placed (no cache write) so the transient content never enters the cache.
_prev_cache_index is not updated, so the next non-ephemeral call’s B1
still points to the last valid write and produces a cache hit. B1 is still
placed to provide a cache read for the stable prefix on the ephemeral call
itself.
Cost estimation. MODEL_PRICING in config.py stores a 4-tuple per
model prefix: (input, output, cache_read, cache_write) per million tokens.
estimate_cost() computes: uncached * input + cache_read_tokens * cache_read_rate + cache_creation_tokens * cache_write_rate + output * output_rate. Gemini has cache_write = 0 (implicit caching has no write
surcharge); Claude has non-zero cache_write (1.25x the base input rate for
explicit breakpoints).
7. Tool System and Auto-Spill
7.1. Tool registry design
ToolRegistry is the single point of dispatch for all tool calls. The
agent calls registry.execute(tool_call) and gets a dict back. It never
calls sandbox methods or data source functions directly.
Three built-in tools are always available:
sandbox_exec– runssh -c <command>in the sandbox container. Returns{exit_code, stdout, stderr}, with auto-spill for large outputs.fetch_to_sandbox– calls a data source function and writes the result to a specified path in the sandbox. Returns metadata only (path, byte count, line count). Used when the model wants explicit path control.fetch_batch_to_sandbox– calls multiplefetch_to_sandboxin one tool call. Saves iterations vs. sequential calls (e.g., fetching both pipelineruns and taskruns in one round-trip).
Data source tools are registered dynamically per workflow config. Each data
source module provides a register() function that creates ToolDef
objects and closures over credentials, then calls
registry.register_data_source(tool_def, func, response_metadata).
7.2. Auto-spill architecture
Auto-spill is the key mechanism for keeping the LLM context bounded. Without
it, large outputs accumulate in the contents list and inflate every
subsequent API call (because both APIs replay the full history).
flowchart TD
Output["Tool produces output"]
SizeCheck{"size > OUTPUT_PREVIEW_BYTES<br/>(4 KB)?"}
Inline["Return full output inline"]
Spill["Write full output to<br/>/tmp/data/_out/N.txt"]
Preview["Return preview<br/>(head + tail) +<br/>file path metadata"]
Output --> SizeCheck
SizeCheck -->|"<= 4 KB"| Inline
SizeCheck -->|"> 4 KB"| Spill --> Preview
Three spill paths share the same _spill_counter to prevent filename
collisions:
sandbox_exec spill. When stdout or stderr exceeds
OUTPUT_PREVIEW_BYTES (4096 bytes), the full output is written to
/tmp/data/_out/{counter}.txt via _spill_field(). The model receives:
{
"exit_code": 0,
"stdout": "<first 4KB>",
"stdout_truncated": true,
"stdout_file": "/tmp/data/_out/0.txt",
"stdout_bytes": 32768,
"stdout_lines": 1024,
"stdout_tail": "<last 512 bytes>"
}
The preview (head + tail) gives the model enough context to decide whether
to process the full file with jq/grep/head.
Data source inline spill. When a direct data source call returns text
larger than MAX_INLINE_SIZE (4096 bytes), _spill_data_source() writes
it to /tmp/data/_out/{name}_{counter}.txt and returns:
{
"saved_to": "/tmp/data/_out/konflux_list_pipelineruns_1.txt",
"bytes": 85432,
"lines": 2100,
"preview": "<first 4KB>"
}
Streaming responses (e.g., gitlab_get_repo_archive which returns a
StreamingResponse with suffix .tar.gz) are written via
_spill_streaming(). Text streams (binary=False, the default) capture a
UTF-8 preview from the head while piping chunks to the sandbox. Binary
streams (binary=True) skip preview capture entirely and return only
{saved_to, bytes}, avoiding meaningless decoded output for formats like
tar.gz. The StreamingResponse.suffix field controls the file extension
(.jsonl, .tar.gz, .log).
Non-streaming binary responses are written to .bin files with no preview
via _spill_binary().
fetch_to_sandbox spill. Always writes to the caller-specified path
(not /tmp/data/_out/). Returns metadata only (no preview). The model uses this
when it wants a specific filename for later processing.
7.3. Why auto-spill instead of rejecting large outputs
The earlier design rejected large data source responses with “use
fetch_to_sandbox instead.” This caused two wasted iterations per rejection:
the model makes the call, gets rejected, then has to repeat with
fetch_to_sandbox. With auto-spill, the data flows to a sandbox file
transparently. Validation showed ~10% fewer iterations with auto-spill.
7.4. Why fetch_to_sandbox still exists alongside auto-spill
Auto-spill handles the common case, but fetch_to_sandbox provides:
- Explicit path control. The model can choose meaningful filenames
(
/tmp/data/pipelineruns.json) instead of getting auto-generated names (/tmp/data/_out/konflux_list_pipelineruns_1.txt). - Batch fetching.
fetch_batch_to_sandboxcombines multiple fetches in one tool call, saving iterations. - No preview overhead.
fetch_to_sandboxreturns metadata only, which is useful when the model knows it will process the file withsandbox_execanyway.
7.5. ToolDef notes
ToolDef has an optional notes field for domain knowledge that belongs
in the system prompt but not in the tool’s JSON schema. Examples:
- Konflux notes explain dual K8s/Kubearchive fetch, UID deduplication, and the two label selectors (BUILD vs. TEST).
- Testing Farm notes explain XML result structure and usage patterns.
ToolRegistry.get_tool_notes() collects all non-None notes into a
## Data Source Notes section that is prepended to the workflow body in
the system prompt:
BASE_SYSTEM_PROMPT + tool_notes + workflow_body
This keeps domain knowledge close to the tool definitions (in the data
source module) rather than duplicated in every workflow .md file.
7.6. Response metadata
register_data_source() accepts optional response_metadata – a dict
that is merged into every response from that tool (inline, auto-spill, and
fetch_to_sandbox). Used for:
- Konflux:
{"konflux_ui": "https://..."}so the model can build reviewer-facing PipelineRun links. - Testing Farm:
{"artifacts_base": "https://..."}so the model can build artifact links.
This avoids having the model ask “what is the Konflux UI URL?” – the information arrives with every tool response.
8. Sandbox Architecture
8.1. The Sandbox protocol
All backends implement an 8-method typing.Protocol:
class Sandbox(Protocol):
def start(self) -> None # create container/pod, pre-create /tmp/data/
def exec(command, *, stdin_data: bytes | None) -> ExecResult # sh -c in sandbox
def write_file(path, data: bytes) -> None # stdin pipe: cat > path
def stream_to_file(path, chunks: Iterable[bytes]) -> tuple[int, int] # Popen stdin pipe, returns (bytes, lines)
def stream_exec(command, chunks: Iterable[bytes]) -> ExecResult # Popen with streaming stdin
def read_file_iter(path) -> Iterator[bytes] # Popen stdout pipe in chunks
def cleanup(self) -> None # rm -f container / delete pod
def linger(session_id) -> None # keep alive for reuse, or fall back to cleanup
ExecResult is (exit_code: int, stdout: str, stderr: str).
stream_to_file() and stream_exec() use subprocess.Popen with a stdin
pipe, writing chunks incrementally. This avoids buffering large payloads in
orchestrator memory (e.g., streaming a tar.gz archive into the sandbox).
read_file_iter() reads from a Popen stdout pipe in 64 KB chunks.
write_file() uses stdin piping (cat > path), never host volume mounts.
This is critical: it means data flows through the orchestrator process, not
through a shared filesystem. The sandbox has no host mounts.
8.2. Backend selection
Three sandbox backends are available:
| Backend | --sandbox |
When to use |
|---|---|---|
PodmanSandbox |
podman |
Local development (default for run) |
K8sSandbox |
k8s |
Direct pod creation on a K8s cluster |
K8sPoolSandbox |
k8spool |
Pre-warmed pool via Deployment (default for serve) |
create_sandbox() factory handles podman and k8s. When --sandbox k8spool is selected, __main__.py creates a SandboxPool and passes it
to run_workflow(), which instantiates K8sPoolSandbox directly.
K8s is never auto-detected from the environment. It requires explicit CLI flags. This prevents accidental use of a K8s sandbox when developing locally.
For K8s namespace resolution (k8s and k8spool):
- If
--namespaceis provided, use it. - If
--contextis provided, extract the namespace from the kubeconfig context. If the context has no default namespace, raise an error. - If neither is provided (in-cluster), use
sandbox.namespacefrom the config file.
8.3. PodmanSandbox
podman run -d --name hb-sandbox-{uuid8} --network=none --user 65532 \
--workdir /tmp {image} sleep infinity
- Unique container name with UUID suffix prevents collisions
sleep infinitykeeps the container alive for repeatedexeccalls--network=noneprovides complete network isolation--user 65532is a fixed non-root UID (matchesnonrootin distroless)- Cleanup:
podman rm -f(force, in case exec is still running)
8.4. K8sSandbox: the hybrid approach
The K8s sandbox uses a hybrid of two tools:
- kubernetes Python library for pod lifecycle:
create_namespaced_pod,read_namespaced_pod(poll for Running),delete_namespaced_pod. kubectl execsubprocess for command execution.
Why not use the kubernetes Python library for exec too? Three problems discovered during development:
-
No stdin EOF signal in WebSocket v1-v4. The Kubernetes exec protocol uses WebSocket channels (stdin=0, stdout=1, stderr=2). Protocol versions 1-4 have no mechanism to signal “stdin is done.” Commands like
cat > /filehang forever waiting for more input. Python client v5 support does not exist. -
BrokenPipeErroron large stdin. Sending more than ~1MB through the WebSocketstream()API causes pipe errors, breakingwrite_file()for large data source responses. -
Unbounded memory from
stream(). Thestream()function accumulates all stdout/stderr data in memory with no streaming control. A command producing megabytes of output would consume unbounded memory.
kubectl exec as a subprocess avoids all three problems and provides the
same interface as podman exec – stdin via subprocess.PIPE, stdout/stderr
captured, exit code from return code. The implementation in exec() is
nearly identical between PodmanSandbox and K8sSandbox.
8.5. Pod manifest design
The K8s pod manifest is built by _build_pod_manifest():
metadata:
labels:
app.kubernetes.io/name: hummingbird-agent-sandbox
spec:
automountServiceAccountToken: false # no K8s API from sandbox
activeDeadlineSeconds: 1800 # 30min hard timeout, backstop
restartPolicy: Never
securityContext:
runAsNonRoot: true
seccompProfile: RuntimeDefault
containers:
- name: sandbox
command: ["sleep", "infinity"]
workingDir: /tmp
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
resources:
requests: {cpu: 100m, memory: 256Mi, ephemeral-storage: 256Mi}
limits: {cpu: "1", memory: 1Gi, ephemeral-storage: 2Gi}
runAsUser is deliberately omitted. On OpenShift, the restricted-v2 SCC
assigns a UID from the namespace UID range. On vanilla K8s, the image’s
USER directive is used. Setting an explicit UID would conflict with
OpenShift’s SCC admission.
8.6. Auth modes
- Local development:
--contextflag passes the kubeconfig context toconfig.new_client_from_config(context=...), creating a per-instanceApiClient. - Production (in-cluster):
context=Nonetriggersconfig.load_incluster_config(), using the pod’s ServiceAccount token.
The kubectl exec commands include --context when running locally but
omit it when in-cluster (kubectl uses the default in-cluster config).
8.7. Pre-created directories
Both backends run mkdir -p /tmp/data in start() after the
container/pod is up. This provides a working directory for the model
without consuming an iteration. /tmp/data/_out (the spill directory) is
created on demand by mkdir -p $(dirname ...) in write_file().
8.8. K8sPoolSandbox: Deployment-backed pool
On-demand pod creation adds 5-30 seconds of latency per workflow (image pull, scheduling, container start). For interactive use (slash commands, reply-based resumption), this delay is user-facing. The pool eliminates it.
Mechanism. A Kubernetes Deployment maintains a set of pods labelled
hummingbird/role: standby. When a sandbox is needed, SandboxPool.claim()
finds a Running standby pod, patches it to role: active, clears its
ownerReferences (detaching it from the ReplicaSet), and annotates it with
hummingbird/reap-by (an absolute UTC deadline). The Deployment controller
sees the ReplicaSet is under the desired replica count and creates a
replacement.
Pod lifecycle.
- Deployment creates pod ->
role: standby(managed by ReplicaSet) claim()patches ->role: active,ownerReferences: [],reap-by: now + max_active_seconds- Agent uses the pod via inherited
K8sSandbox.exec() - On success:
linger()patches ->reap-by: now + linger_seconds,session-id: <id>(pod stays alive) - On reply within linger window:
try_reclaim()patches ->reap-by: now + max_active_seconds, clearssession-id - On failure or linger expiry: pod is deleted (by
cleanup()or the reaper)
Pod lingering. After a successful workflow, pool sandboxes are kept
alive for linger_seconds (default 300) instead of being deleted
immediately. The pod is annotated with hummingbird/session-id to link it
back to the session. If a user reply arrives within the linger window,
try_reclaim() finds the pod by session-id, clears the annotation (marking
it as in-use), and resets reap-by. This skips pod creation and S3 archive
restoration. If no reply arrives, the reaper deletes the pod when its
reap-by deadline passes.
Concurrency safety. session-id is only present while lingering. Its
absence during active execution prevents concurrent replies from adopting
the same pod. try_reclaim() is serialized by a threading lock; the first
caller wins, others fall back to S3 restore with a fresh pod.
Reaping. reap_expired() is called once per workflow execution (after
sandbox acquisition, before the model loop). It performs a single sorted
pass over all active pods:
- Non-lingering pods (no
session-id) past theirreap-bydeadline are deleted immediately. - Lingering pods (with
session-id) are sorted byreap-byascending. The loop deletes pods that are either expired (now > reap-by) or exceedmax_lingering_pods(default 2, configurable), starting with those closest to their deadline. The loop stops when both conditions are satisfied (now <= reap-byand remaining count <= limit).
This replaces the previous design where _reap_expired() ran inside
claim() on every poll iteration. Moving it to a once-per-workflow call
reduces API load and centralizes cleanup. The max_lingering_pods cap
prevents unbounded pod accumulation from many short-lived workflows.
Indefinite wait. claim() polls until a pod is available or the
shutdown event fires. It logs at DEBUG level on each poll, escalating to
WARNING after ~2 minutes. This is normal behavior when the pool (Deployment
replicas) is smaller than max_concurrent_agents – the SQS semaphore caps
concurrency, so demand will not permanently exceed supply.
Config simplification. In pool mode, the pod image, resources, metadata,
and security context are defined solely in the Deployment template. The
agent config only needs sandbox.namespace and
sandbox.active_deadline_seconds. This eliminates duplication between the
agent configmap and the Deployment.
Class design. K8sPoolSandbox inherits from K8sSandbox to reuse
exec(), write_file(), read_file_iter(), and cleanup(). It overrides
__init__() (does not call super().__init__() since that resolves
kubeconfig), start() (claims from pool instead of creating a pod), and
linger() (patches annotations instead of deleting). It adds
start_from_existing() for the reclaim path. All sandbox backends
implement linger(session_id): non-pool backends fall back to cleanup().
9. Data Sources
Data sources are external APIs wrapped as tool-calling functions. The model invokes them by name; the orchestrator executes them and returns results (inline or auto-spilled). Each data source module follows the same pattern:
- Define
TOOL_DEFS– a list ofToolDefobjects with names, descriptions, parameter schemas, and optional notes. - Define implementation functions that take a pre-configured client as the first argument.
- Provide a
register()function that creates the client, wraps implementations withfunctools.partial, and callsregistry.register_data_source()for each tool.
Credentials are captured at registration time via functools.partial
closures. They are never stored as global state and never leak into tool
definitions or model contents.
9.1. GitLab
Uses python-gitlab library with retry_transient_errors=True for
automatic retry on transient HTTP errors.
Tools:
gitlab_get_mr_details– MR metadata (title, author, SHA, labels, URLs)gitlab_get_commit_statuses– all CI/CD statuses for a commit SHA (paginated automatically). Covers both Konflux external statuses and native GitLab CI job statuses. Each status includestarget_url(job page link containing the job ID) andallow_failure.gitlab_get_mr_diff– changed files in the MRgitlab_get_file_at_ref– raw file content at a git refgitlab_get_repo_archive– repository tar.gz viarepository_archive(iterator=True). Returns aStreamingResponsewith chunked binary data (suffix.tar.gz), so the archive streams to the sandbox without buffering in orchestrator memory.gitlab_get_job_log– job trace (log output) for a GitLab CI job. Useslazy=Trueon the job object andtrace(iterator=True)for streaming. Each chunk is decoded witherrors='replace'and ANSI escape codes are stripped per-chunk before re-encoding. Returns aStreamingResponse(suffix.log). Per-chunk ANSI stripping is safe because escape sequences are <20 bytes and chunks are 1024+ bytes.
No response_metadata is set because GitLab commit statuses already contain
target_url fields that the model uses for linking.
gitlab_get_mr_discussions – trust filtering and redaction.
MR discussions are the first data source that feeds user-generated free text into the model prompt. Unlike diffs and CI logs (which are code or machine output), discussion comments can contain arbitrary prose written by anyone who can post on the MR – including external contributors on public projects. This creates a prompt injection vector: an attacker posts a comment containing instructions that manipulate the model’s review output.
The discussions tool addresses this with a three-layer defence:
-
Author trust gate. Each note’s author is checked for Developer+ access (level >= 30) on the project via
members_all.get(author_id). Results are cached perauthor_idwithin a single call to avoid repeated API lookups. System notes (merge events, label changes) bypass the author check. Agent-authored notes are trusted via their author’s Developer+ access (the bot account holds Developer access on configured projects). Note: agent note detection for redaction purposes (stripping transcripts and footers) uses session marker presence, but trust is always based on the author’s access level, not the marker. This prevents marker injection from granting trust to non-member notes. -
Redaction, not sanitization. Untrusted notes in a discussion that also contains trusted notes are replaced with a fixed placeholder (
[redacted: non-member comment]). The placeholder preserves the conversation structure (the model sees that someone replied, but not what they said). Discussions where all notes are untrusted are dropped entirely. No attempt is made to sanitize or escape untrusted content – there is no reliable escaping mechanism for LLM prompts, so the only safe option is to withhold the content entirely. -
Agent note stripping. Agent-authored notes contain large
<details><summary>Agent transcript</summary>...</details>blocks (15K-90K chars of raw tool calls), metadata footers, and session markers. These are stripped before the note enters the model prompt, leaving only the review text. This serves dual purposes: token efficiency and avoiding feeding the model its own raw tool call history (which could cause degenerate self-referential loops).
Residual risks and scope limitations:
- A compromised Developer+ account can inject adversarial content. This is accepted as equivalent to the existing risk of a compromised developer pushing malicious code (which the agent would also process).
- The trust threshold is project-level (Developer role on the project), not MR-level. A developer on the project can influence any MR’s review.
- Comments posted after the discussions are fetched but before the review is posted are not seen. This is a TOCTOU gap but has no security impact (the model simply misses late comments).
9.2. Konflux
Uses raw requests against K8s and Kubearchive APIs (not the kubernetes
Python SDK). This avoids the heavy kubernetes client dependency for what is
essentially bearer-token HTTP with label selectors.
Dual-fetch architecture:
flowchart LR
subgraph fetch [Fetch Phase]
KA["Kubearchive<br/>(historical)"]
K8s["K8s API<br/>(live)"]
end
Combine["Combine items"]
Dedup["Deduplicate<br/>by metadata.uid"]
KA --> Combine
K8s --> Combine
Combine --> Dedup
For each resource type (PipelineRuns, TaskRuns), the client fetches from
both Kubearchive (completed/historical resources) and the live K8s API,
combines the results, and deduplicates by metadata.uid. This ensures no
resources are missed regardless of whether they have been archived yet.
Two label selectors. Konflux uses different labels for BUILD and TEST PipelineRuns:
- BUILD:
pipelinesascode.tekton.dev/sha=<commit_sha> - TEST:
pac.test.appstudio.openshift.io/sha=<commit_sha>
Each get_pipelineruns()/get_taskruns() call queries both selectors and
combines the results.
Pod log fetching. get_pod_log() accepts an optional container
parameter. When specified, it fetches logs for that single container. When
omitted, it discovers all containers from the pod spec and concatenates
their logs with === container_name === headers. Each individual log
fetch tries Kubearchive first, then falls back to the live K8s API. Logs
may be unavailable from both sources if the pod has expired.
Streaming pagination. K8s list endpoints can return pages of 80+ MB
when a commit touches many components (e.g. 500 PipelineRuns per page).
iter_paginated() uses session.get(stream=True) and ijson.parse() to
stream-parse each page: items are yielded one at a time via
ObjectBuilder, and the metadata.continue pagination token is captured
from the same parse pass. resp.raw.decode_content = True is set so
urllib3 transparently decompresses gzip/deflate Content-Encoding inline
(Kubearchive returns gzip-compressed responses). Only one item is in
memory at a time – O(single_item) regardless of page size. Truncated or
malformed streams raise IncompleteJSONError, which is caught and treated
like a network error (log a warning, stop iterating that source).
Credential resolution. KonfluxClient.__init__() parses the kubeconfig
file to extract the API server URL and bearer token for the cluster. The
cluster domain is extracted from the cluster_url config value. This
approach avoids depending on kubectl or the kubernetes Python library for
API authentication.
Response metadata: {"konflux_ui": "https://konflux-ui.apps.<domain>/ns/<namespace>"} is
merged into every response so the model can build reviewer-facing links
like [{name}]({konflux_ui}/pipelinerun/{name}).
9.3. Testing Farm
Uses requests.Session with a module-level retry adapter (429, 5xx) for
resilience against transient errors.
Tools:
tf_get_results– JUnit XML results for a request IDtf_get_test_log– individual test log by URL (from results.xml); restricted to theARTIFACTS_BASEprefix to prevent SSRFtf_get_request_status– request state, queue/run times
ToolDef notes on tf_get_results document the XML structure:
//testsuites/@overall-result, //testcase/@result, //testcase/logs/log
with @name and @href. This domain knowledge goes into the system prompt
so the model knows how to parse the XML with sandbox_exec using
python3 xml.etree.ElementTree.
Response metadata: {"artifacts_base": "https://artifacts.osci.redhat.com/testing-farm"}
is merged into every response so the model can build artifact links like
[{request_id}]({artifacts_base}/{request_id}/).
9.4. Registration flow
data_sources.register_selected() is the entry point called by runner.py.
It iterates the workflow config’s data_sources dict and registers only the
declared sources:
for ds_name, ds_cfg in wf_cfg.data_sources.items():
if ds_name == "gitlab":
token = resolve_tool_token(proj_entry, "gitlab", wf_cfg)
gitlab.register(registry, gitlab_url, token)
elif ds_name == "konflux":
cluster_url = resolve_cluster_url(ds_cfg)
kubeconfig_path = os.environ.get(ds_cfg.kubeconfig_env, "")
konflux.register(registry, cluster_url, kubeconfig_path, ds_cfg.kubearchive_url)
elif ds_name == "testing_farm":
testing_farm.register(registry)
This selective registration means a workflow with data_sources: {gitlab: {...}}
only exposes GitLab tools to the model. Konflux and Testing Farm tools do
not appear in the tool definitions, preventing the model from attempting to
use unconfigured data sources.
9.5. Data flow: streaming vs buffered
Data moves from external APIs through the orchestrator to the model (inline or auto-spilled to sandbox). The memory profile of each tool depends on whether the HTTP response is consumed incrementally or buffered entirely.
Streaming (preferred for large/unbounded responses). The HTTP response
is consumed incrementally – via ijson stream parsing, iterator=True
in python-gitlab, or lazy pagination. Only one chunk or item is in memory
at a time. Used by:
iter_paginated()(Konflux) –session.get(stream=True)+ijsonget_commit_statuses()(GitLab) –statuses.list(iterator=True)via_iter_commit_statuses(), yielding JSONL bytes one status at a timeget_repo_archive()(GitLab) –repository_archive(iterator=True), returns chunked tar.gz binary viaStreamingResponse(suffix=".tar.gz", binary=True)– no text preview is generatedget_job_log()(GitLab) –trace(iterator=True)with per-chunk ANSI stripping, returnsStreamingResponse(suffix=".log")
All produce StreamingResponse objects that flow through
stream_to_file() into the sandbox without accumulating in orchestrator
memory. StreamingResponse.suffix controls the auto-spill filename
extension (.jsonl, .tar.gz, .log). StreamingResponse.binary
controls whether a head preview is captured (False for text, True to
skip for opaque binary formats).
Buffered (acceptable for small/bounded responses). The full response is loaded into memory as a string or dict. This is fine when the response size is bounded and small (typically < 1 MB). Used by:
get_mr_details()(GitLab) – single MR object, < 10 KBget_file_at_ref()(GitLab) – single file, bounded by repo constraintstf_get_results(),tf_get_test_log(),tf_get_request_status()(Testing Farm) – XML/text/JSON, typically < 1 MB
Buffered with risk (candidates for future streaming). Same buffered pattern but the response size is not bounded by design. Auto-spill mitigates the context growth problem (large outputs are written to sandbox files), but the orchestrator still spikes RSS during the fetch:
get_pod_log()(Konflux) –resp.textper container, concatenated. Multi-container pods accumulate all logs.get_mr_diff()(GitLab) – full diff JSON, scales with MR size. GitLab truncates server-side but the result can still be large.
Invariant: paginated K8s list endpoints must always use streaming. Page sizes scale with the number of components in the commit – a single page can contain hundreds of PipelineRuns or TaskRuns (80+ MB JSON). Buffering these responses risks OOM under normal production workloads, especially with concurrent workflows.
10. Event Pipeline and Production Operations
10.1. Two-stage SQS pipeline (ingress router + FIFO worker)
Events flow through a two-stage pipeline to serialize per-session work:
flowchart LR STD["Standard SQS<br/>(ingress)"] R["Router"] FIFO["SQS FIFO<br/>(work queue)"] W["Worker<br/>(poll_loop)"] STD -->|ReceiveMessage| R R -->|"SendMessage<br/>MessageGroupId"| FIFO W -->|"continuation<br/>group=session_id"| FIFO FIFO -->|ReceiveMessage| W
Ingress router. events.router_loop() is a single-threaded loop
that long-polls the standard (ingress) queue, peeks into each SNS
envelope to assign a MessageGroupId, forwards the raw message body
to the FIFO queue, then deletes from the standard queue.
- Note events:
MessageGroupId = discussion_idfrom the webhook body. This serializes replies to the same agent session – FIFO delivers at most one in-flight message per group. - Pipeline / MR events:
MessageGroupId = SQS MessageId(unique per message). No serialization – each event is its own group. - MessageDeduplicationId: Always the SQS
MessageIdfrom the standard queue. Absorbs duplicate deliveries from standard SQS’s at-least-once semantics (5-minute dedup window).
If SendMessage to FIFO fails, the message is not deleted from the
standard queue and retries via visibility timeout.
FIFO worker. events.poll_loop() long-polls the FIFO queue,
decodes SNS envelopes, and dispatches to handle_event. The
max_concurrent_agents semaphore caps parallel groups being processed.
FIFO guarantees that within a group, only one message is in-flight.
Two-phase processing. When the FIFO worker picks up a webhook
(Phase 1), the handler validates the event, resolves the session_id,
then posts a continuation message back to the same FIFO with
MessageGroupId = session_id. The Phase 1 message is deleted quickly
(~1-5s). Phase 2 processes the continuation: loading session state,
building WorkflowRequest, and calling _execute_workflow. Because
Phase 2 continuations share a MessageGroupId per session, all work
for a given session is serialized – even if it spans multiple GitLab
discussion threads.
Uniform message format. Both webhook messages (from SNS) and
continuation messages use the same SNS-style envelope with gzip+base64
compression. encode_envelope() is the inverse of
decode_sns_message(). The consumer doesn’t need to distinguish
between webhook and continuation messages at the transport layer.
Graceful shutdown. SIGTERM/SIGINT set a shutdown_event. Both
loops exit. Messages in SQS become visible again after the visibility
timeout for other consumers.
Why two queues. The standard queue receives events from SNS (via subscription). The FIFO queue serializes per-session work. The gitlab-event-forwarder and SNS subscription are unchanged – the routing logic lives entirely in the agent codebase.
10.2. SNS envelope decoding
Messages arrive as SNS notifications with:
MessageAttributes.source– event source (e.g.,"gitlab")MessageAttributes.event_type– event type (e.g.,"pipeline","merge_request","note")MessageAttributes.content_encoding–"gzip+base64"for gitlab-event-forwarder, absent for kubernetes-event-forwarderMessage– the actual event body (JSON string, or gzip+base64 encoded)
decode_sns_message() handles both encoding formats transparently.
10.3. Event routing
runner.handle_event() dispatches on (source, event_type):
flowchart TD
Event["Event arrives"]
Check{"source/type?"}
Cont["handle_continuation()"]
Pipeline["handle_pipeline()"]
MRHandler["handle_merge_request()"]
Note["handle_note()"]
Ignore["Ignore"]
Event --> Check
Check -->|"agent/continuation"| Cont
Check -->|"gitlab/pipeline"| Pipeline
Check -->|"gitlab/merge_request"| MRHandler
Check -->|"gitlab/note"| Note
Check -->|"other"| Ignore
Pipeline --> FilterStatus{"status failed?<br/>source = merge_request_event?"}
FilterStatus -->|"yes"| PipeLookup["filter: trigger=pipeline<br/>+ ignore_users"]
FilterStatus -->|"no"| Skip1["Skip"]
PipeLookup --> Enqueue1["_enqueue_continuation()"]
MRHandler --> FilterMRAction{"action in open/reopen/update?<br/>draft = false?"}
FilterMRAction -->|"yes"| MRLookup["filter: trigger=merge_request<br/>+ ignore_users + ignore_branches"]
FilterMRAction -->|"no"| Skip1b["Skip"]
MRLookup --> Enqueue2["_enqueue_continuation()"]
Note --> FilterMRNote{"MR note?<br/>action = create?"}
FilterMRNote -->|"yes"| NoteType
FilterMRNote -->|"no"| Skip3["Skip"]
NoteType{"Note type?"}
NoteType -->|"author_id == bot_id"| Skip4["Skip (own note)"]
NoteType -->|"/hummingbird help"| Help["post_simple_reply (help)"]
NoteType -->|"/hummingbird wf-name"| SlashCmd["_resolve_slash_workflow()<br/>→ _enqueue_continuation()"]
NoteType -->|"DiscussionNote reply"| FindSession["find_session_for_reply()"]
NoteType -->|"other"| Skip5["Skip"]
FindSession --> Found{"session found?"}
Found -->|"yes"| Enqueue3["_enqueue_continuation()<br/>(resume_session)"]
Found -->|"no"| Skip6["Skip"]
Cont --> ContType{"work_type?"}
ContType -->|"new_session"| ExecNew["_execute_workflow()"]
ContType -->|"resume_session"| LoadS3["load session from S3<br/>→ _execute_workflow()"]
Phase 1 handlers (handle_pipeline, handle_merge_request,
handle_note) perform validation, auth checks, and session resolution,
then post a WorkOrder continuation message via _enqueue_continuation().
They do not call _execute_workflow directly.
Phase 2 (handle_continuation) receives the WorkOrder, resolves
the workflow config, builds a WorkflowRequest, and calls
_execute_workflow. For resume_session orders, it loads the previous
session from S3. _execute_workflow handles SHA dedup and per-workflow
rate limiting via scan_agent_threads.
Slash commands dispatch to a single named workflow via
_resolve_slash_workflow (exact match, then prefix match). Help and error
replies are posted via post_simple_reply (no session marker, no rate
limit impact).
10.4. Pipeline trigger design
Why gitlab::pipeline instead of kubernetes::PipelineRun: A GitLab
pipeline event fires once when the pipeline completes. Since Konflux
external stages are attached to the pipeline, the event naturally waits for
all builds and tests to finish before triggering. This means the agent sees
the full picture in one event, without needing to deduplicate or wait for
stragglers.
Trigger filters:
statusin{failed}– only failed pipelines trigger analysissource == "merge_request_event"– only MR pipelines, not branch/tag
10.4a. Merge request trigger design
The handle_merge_request handler fires on MR open, reopen, and update
events (action in {"open", "reopen", "update"}). Draft MRs are skipped
(object_attributes.draft == True); marking a draft as ready triggers a
review since the event arrives with draft: false.
Event classification is deliberately simple: the handler does not inspect
oldrev or changes.draft fields. Instead, SHA-based deduplication in
_execute_workflow (via JSON session markers) ensures each code revision
is reviewed at most once. Metadata-only updates (title, label changes) on
an already-reviewed SHA are silently skipped.
10.5. Note trigger design
Two sub-flows:
Slash command (/hummingbird <workflow-name>): Triggers a specific
named workflow, bypassing rate limiting. _resolve_slash_workflow performs
exact-match lookup first, then falls back to unique prefix matching. If
the subcommand is missing or is “help”, _format_help returns a list of
available workflows with descriptions. Ambiguous or unknown subcommands
produce an error message via post_simple_reply. The note author must
have Developer+ access (level >= 30) on the project, checked via
check_member_access(). If denied, the agent replies in the same
discussion thread with a short access-denied message.
Reply to agent note: When a user replies to an existing agent note (which contains a JSON session marker):
find_session_for_reply()walks the discussion thread, filtering by bot author ID (get_bot_user_id()), and returns the first matching session marker. All bot markers in a thread share the same session ID.handle_noteposts aresume_sessioncontinuation to the FIFO (grouped bysession_id).handle_continuationloads the session from S3 (conversation history and sandbox archive), builds aWorkflowRequest, and calls_execute_workflow()to resume with the user’s reply.- If the session is not found (expired/deleted), the agent replies with a message explaining the session has expired and suggesting to start a new run.
Reply authors are also subject to the Developer+ access check. If denied, the agent replies in the discussion thread with the same access-denied message.
Non-agent-directed notes: Regular comments that are neither slash commands nor replies to agent threads are silently ignored (debug-level log). No access check is performed for these.
Self-note filtering: Notes where author_id matches the bot’s own
user ID (resolved via get_bot_user_id()) are skipped immediately. This
prevents infinite loops where the agent’s own output triggers another
agent run. The author-ID check replaces the previous substring check
(marker_prefix_str in note_body), which was vulnerable to
denial-of-service: a user including the marker prefix in their reply
would cause the bot to silently ignore it.
Threading logic for replies: If the project requires internal notes
(internal_notes: true) but the original discussion was public, the reply
is posted as a new top-level internal note instead of replying in the
public thread. This prevents leaking internal analysis into public threads.
10.6. Note lifecycle
1. create_placeholder_note() -> "Running the <workflow> workflow..."
+ JSON session marker (id, wf, sha)
2. run_workflow() -> agent loop
3. update_note() -> replace placeholder with result
+ reply prompt + JSON session marker
Reply notes (from session resumption) omit the reply prompt since the
user is already engaged in the conversation.
On failure, the placeholder is updated to “Hummingbird analysis failed.” The JSON session marker is always present so the note can be identified as agent-generated (for rate limiting, SHA dedup, and self-note filtering). The marker embeds the workflow name and commit SHA, enabling per-workflow rate limiting and SHA-based deduplication.
10.7. Rate limiting and SHA deduplication
scan_agent_threads() performs a single pass over MR discussions, parsing
JSON session markers (<!-- hummingbird-session: {"id":...,"wf":...,"sha":...} -->)
to determine:
- Per-workflow thread count – how many threads the given workflow has
created on this MR. If the count meets or exceeds
max_runs_per_mr, the workflow is skipped. - SHA dedup – whether the current commit SHA has already been reviewed by this workflow. If so, the workflow is skipped (prevents re-reviewing the same code on metadata-only MR updates).
Slash commands and replies bypass both checks entirely.
The rate limit is per-workflow per-MR: different workflows maintain
independent thread counts on the same MR. JSON markers are backward
compatible – old plain-UUID markers (<!-- hummingbird-session: UUID -->)
are parsed as {"id": "UUID"} with no workflow or SHA information, so they
are not counted toward any specific workflow’s limit.
11. Session System
Sessions enable conversation continuity: a user can reply to an agent note and the agent picks up where it left off, with full context and sandbox files restored.
11.1. What gets saved
Three artifacts are saved after each run:
context.json– the fullcontentslist from the agent loop. This is the complete conversation history in Gemini wire format (user turns, model turns with tool calls, tool response turns). It is the minimum state needed to resume the conversation.transcript.md– a human-readable markdown rendering of the run for debugging and auditing. Not used for resumption.sandbox.tar.gz– an archive of/tmp/data/(the sandbox working directory, which includes the_out/spill subdirectory). Contains PipelineRun JSONs, test logs, jq output, and any other files the model created during the run. Restored into the new sandbox on resumption so the model can reference its previous work.
11.2. Storage backends
S3 (production):
s3://{bucket}/sessions/{session_id}/context.json
s3://{bucket}/sessions/{session_id}/transcript.md
s3://{bucket}/sessions/{session_id}/sandbox.tar.gz
Local directory (development, --save-session):
{directory}/context.json
{directory}/transcript.md
{directory}/sandbox.tar.gz
Both backends have the same interface. S3 save is best-effort: wrapped in try/except so a transient S3 error does not prevent the MR note from being posted. The note always gets delivered first.
11.3. Sandbox archive transport
The sandbox archive requires special handling because the orchestrator cannot directly access the sandbox filesystem (no volume mounts):
Archive (inside sandbox):
sb.exec("tar czf /tmp/_archive.tar.gz -C /tmp data")
Stream out (sandbox -> host temp file):
for chunk in sb.read_file_iter("/tmp/_archive.tar.gz"):
tmp.write(chunk) # 64 KB chunks, no base64
Restore (host temp file -> new sandbox):
sb.stream_exec("tar xzf - -C /tmp", file_chunks())
Both directions use streaming binary I/O via Popen stdin/stdout pipes.
No base64 encoding is needed: read_file_iter() yields raw bytes from
the sandbox via a Popen stdout pipe, and stream_exec() feeds raw bytes
into a Popen stdin pipe. The archive is never extracted on the
orchestrator – it exists only as opaque bytes being transported between
sandboxes. This is a security invariant: the orchestrator never parses or
inspects the archive contents.
11.4. Resumption flow
flowchart TD
Reply["User replies to agent note"]
FindSession["find_session_for_reply()<br/>walks discussion thread"]
LoadS3["load_s3(session_id)"]
NotFound{"session found?"}
Expired["Post session-expired reply"]
NewSandbox["Start new sandbox"]
Restore["restore_sandbox(archive)"]
BuildContents["contents = old_contents + user_reply"]
RunLoop["run_agent_loop(<br/>initial_contents, CONTINUATION_PROMPT)"]
Reply --> FindSession --> LoadS3
LoadS3 --> NotFound
NotFound -->|"no"| Expired
NotFound -->|"yes"| NewSandbox
NewSandbox --> Restore --> BuildContents --> RunLoop
Key aspects:
- Fresh iteration budget. The resumed session starts iteration 0 with
the full
max_iterationsbudget, regardless of how many iterations the previous session used. - Context window is the real limit. A typical 8-iteration cold start
uses ~50-100K input tokens. The restored history is sent in full on every
API call. The
CONTEXT_LIMITcheck applies and will force wrap-up if needed. - Session ID reuse. The resumed session keeps the original session ID. S3 state is overwritten in place with the updated conversation history. Since GitLab discussions are linear (not branching), there is no need for a tree of sessions – the conversation is a single sequential thread.
- Graceful degradation. If the S3 session is not found (expired, deleted), the agent posts a reply explaining the session has expired and suggesting to start a new run. It does not fall back to a cold start.
11.5. Session markers
Every agent note contains a hidden HTML comment with a JSON payload:
<!-- hummingbird-session: {"id":"UUID","wf":"code-review","sha":"abc123"} -->
The JSON payload contains:
id– session UUID (always present)wf– workflow name (present for new-format markers)sha– commit SHA at the time of review (present for auto-triggered runs)
This marker serves four purposes:
- Resumption:
find_session_for_reply()searches discussion threads for this marker to find the session ID. Only bot-authored notes are considered; the first matching marker wins. - Per-workflow rate limiting:
scan_agent_threads()counts discussion threads for a specific workflow using thewffield. Only bot-authored notes are scanned. - SHA deduplication:
scan_agent_threads()checks whether the current SHA has already been reviewed by the workflow using theshafield. - Self-filtering:
handle_note()compares the webhook event’sauthor_idagainst the bot’s own user ID (viaget_bot_user_id()) to prevent infinite loops.
All marker-scanning functions resolve the bot user ID via gl.auth()
(GET /user) on the orchestrator token. This ensures markers in
non-bot notes (user replies, external contributors) are never parsed,
preventing session hijacking via injected markers.
Old plain-UUID markers (<!-- hummingbird-session: UUID -->) are parsed
as {"id": "UUID"} for backward compatibility. They are counted for
resumption but not for per-workflow rate limiting or SHA dedup (no
wf/sha information).
12. Workflow System
12.1. Separation of prompt and metadata
A workflow has two parts that live in different places:
- Prompt (
.mdfile): the LLM system prompt, verbatim. This is the investigation strategy, tool usage guidance, data patterns, and output format. Pure text, no code. - Metadata (YAML config): operational settings –
action,model,max_iterations,data_sources,projects. This controls what the orchestrator does with the workflow, not what the model does.
This separation means changing the analysis strategy (e.g., adding a new investigation step) requires editing a markdown file. Changing operational parameters (e.g., which projects use this workflow, which model to use) requires editing the YAML config. Neither requires a code change.
12.2. System prompt layering
The system prompt sent to the model is built from three layers:
BASE_SYSTEM_PROMPT # agent.py: sandbox rules, tool usage tips
+ tool_notes # from ToolRegistry: per-data-source domain knowledge
+ workflow_body # full content of e.g. workflows/analyze-failures.md
On session resumption, a fourth layer is appended:
+ CONTINUATION_PROMPT # agent.py: "do not re-run the full workflow"
The system prompt is rebuilt every iteration (to allow warning suffixes to be appended), but the base content is stable. Warning suffixes are appended at the end so they override earlier instructions.
12.3. Design choice: strategy, not procedure
The workflow .md file describes strategy and guidance, not a rigid script.
The model decides when and how to use each tool based on what it sees.
This matters because merge request failures are diverse. A fixed procedure would either miss edge cases (e.g., build failures before tests ran) or waste iterations on steps that don’t apply. By giving the model a strategy (“identify failing tests, fetch details, analyze root causes, group by similarity”), it can adapt to whatever it encounters.
The workflow does structure the investigation into phases (Data Collection, Analysis, Output) for clarity, but these are guidelines, not enforced checkpoints.
12.4. Workflow file anatomy (analyze-failures.md)
The primary workflow is structured as:
- Scope and approach – what this workflow analyzes and what it ignores
- Input – event JSON format (
project,iid) - Data Collection Phase:
- Data.1: Fetch MR details (direct call, small response)
- Data.2: Fetch commit statuses, identify failures
- Data.3: Batch fetch PipelineRuns + TaskRuns (fetch_batch_to_sandbox)
- Data.4: Process with jq, extract Testing Farm data in bulk
- Analysis Phase:
- Analysis.1a: Investigate each failed PipelineRun individually
- Analysis.1b: Summarize and group by root cause
- Output – markdown template with root causes, collapsible details, clickable links (PipelineRun, Testing Farm, test logs)
- Error Handling – partial report philosophy
Design considerations embedded in the workflow:
- Efficient bulk fetching: PipelineRuns and TaskRuns are fetched in one
fetch_batch_to_sandboxcall, not individually. - Testing Farm data extracted in Data.4, analyzed in Analysis.1: All TF results.xml are fetched in bulk before analysis starts, enabling cross-failure pattern detection.
- Log fetching is selective: Only 1-2 representative logs per failure pattern, not all logs. This keeps iteration count bounded.
- Reviewer-facing URLs: The output template instructs the model to
include clickable links using
konflux_uiandartifacts_basefrom response metadata.
12.5. Prompt file resolution
get_prompt(workflow_cfg, config_dir) resolves the prompt: field relative
to the config file’s directory:
prompt_path = config_dir / workflow_cfg.prompt
# e.g. /app/config.yaml with prompt: workflows/analyze-failures.md
# -> /app/workflows/analyze-failures.md
In the container image, workflows are baked in at /app/workflows/. A
ConfigMap can override them at deploy time by mounting at the same path.
13. Deployment and Container
13.1. Container build strategy
The Containerfile uses an all-RPM builder+installroot+scratch pattern (no pip, no venv):
- Builder stage: Uses a Fedora-based builder image with dnf. Installs
all dependencies as RPMs into a clean
--installroot. - Application code: Copies
hummingbird_agent/andworkflows/into the installroot at/app/. - Final stage:
FROM scratch, copies the entire installroot. No package manager, no shell beyond what RPMs provide.
RPM dependencies: python3-boto3, python3-google-auth+requests,
python3-gitlab, python3-kubernetes, python3-pyyaml, python3-requests,
python3-sentry-sdk, kubernetes1.35-client (for kubectl).
This approach was chosen over pip because:
- All deps come from Fedora’s package repository – no PyPI supply chain risk
- Smaller image (~22 MB saved vs. google-genai SDK alone)
- Reproducible builds from known RPM versions
- No compilation step (no gcc/python3-devel in the image)
13.2. Container runtime properties
CMD ["python3", "-m", "hummingbird_agent", "serve"]
WORKDIR /app
USER 65532
The default CMD runs serve mode for production. Local development uses
run mode via explicit command override. USER 65532 matches the standard
nonroot UID used by distroless images and the Podman sandbox.
13.3. K8s deployment manifests
Located in hummingbird-agent/kubernetes/:
deployment.yaml:
- Single replica with rolling update (
maxSurge: 1,maxUnavailable: 0) terminationGracePeriodSeconds: 900(15 minutes) to allow in-flight agent runs to complete on shutdown- ServiceAccount:
hummingbird-agent - Secrets mounted from K8s Secret
hummingbird-agent - Commented-out mounts for workflow ConfigMap and custom CA trust
rbac.yaml (applied in the sandbox namespace):
- ServiceAccount
hummingbird-agent - Role with
pods: create, get, list, delete, patchandpods/exec: create - RoleBinding linking the SA to the Role
patchis required for pool mode (relabeling pods during claim)
sandbox-pool.yaml (applied in the sandbox namespace):
- Deployment with
replicas: 3(tuned to balance latency vs cost) - Pods labelled
hummingbird/role: standbyfor pool discovery - Same security context and resource limits as direct-creation pods
revisionHistoryLimit: 2, rolling update withmaxSurge: 1,maxUnavailable: 0
networkpolicy.yaml:
podSelector: {}– applies to all pods in the namespaceegress: []– deny all outbound traffic
secret.yaml:
- Template with all required env vars (API keys, tokens, URLs)
- Values must be populated per deployment
13.4. Workflow mounting
Workflows are baked into the image at /app/workflows/. To update workflows
without rebuilding the image:
- Create a ConfigMap from the workflows directory
- Uncomment the volume mount in the deployment YAML
- The ConfigMap mount replaces the baked-in directory (all-or-nothing)
This is useful for rapid iteration in staging without waiting for a new image build.
13.5. Custom CA trust
For clusters with internal CA certificates (common in enterprise environments), the deployment supports OpenShift’s CA injection:
- Create a ConfigMap with the
config.openshift.io/inject-trusted-cabundlelabel - Mount it at
/etc/pki/custom - Set
REQUESTS_CA_BUNDLE=/etc/pki/custom/ca-bundle.crt
OpenShift automatically injects the cluster’s CA bundle into the ConfigMap.
14. Design Decision Registry
Each entry records a decision, the alternatives considered, why the chosen approach won, and what would break if the decision were reversed. This is the most important section for avoiding regressions.
kubectl exec over kubernetes Python exec API
- Chosen:
kubectl execas subprocess for sandbox command execution. - Alternative: kubernetes Python client
stream()API. - Why: Three showstopper bugs in the Python client: (1) WebSocket
v1-v4 has no stdin EOF signal, so
cat > /filehangs forever; (2)BrokenPipeErroron stdin larger than ~1MB; (3)stream()accumulates all stdout in memory with no control.kubectlavoids all three and provides the same subprocess interface as Podman. - If reversed:
write_file()would hang or fail on large data. Pod log retrieval with large outputs would OOM. Sandbox reliability would drop significantly.
YAML config over env vars for operational settings
- Chosen: Single YAML file for workflows, projects, limits, data source declarations.
- Alternative: Everything in env vars (original design).
- Why: Env vars cannot express structured data (workflow-project mappings, per-project token overrides, data source config with multiple fields). YAML provides structure, validation, and audit trails.
- If reversed: Token scoping would be lost (no per-workflow-project token overrides). Project allowlists would be impossible. The config would be unauditable.
Token env var names in YAML, not values
- Chosen: YAML contains
token_env: GITLAB_TOKEN_RO(the env var name), not the actual token value. - Alternative: Inline secrets in YAML, or env vars for everything.
- Why: The YAML file can be committed, reviewed, and audited. Actual secrets stay in env vars (injected via K8s Secrets). Inline secrets would make the config file a secret itself.
- If reversed: The config file would become a secret, breaking audit trails and code review workflows.
Orchestrator token prefix convention (ORCHESTRATOR_*)
- Chosen: Orchestrator tokens use
ORCHESTRATOR_GITLAB_TOKEN_*env var names by convention. - Alternative: Same env var namespace as model tokens, distinguished by context.
- Why: Structural separation makes it impossible to accidentally pass
an orchestrator token to a model tool (or vice versa). The
ORCHESTRATOR_prefix is never used in YAMLtoken_envfields. - If reversed: A misconfiguration could leak write-capable tokens to the model, which could then expose them via tool calls.
Auto-spill over conversation compaction
- Chosen: Large outputs are saved to sandbox files with previews returned to the model.
- Alternative: Replace old tool results in
contentswith compact summaries (conversation compaction). - Why: Compaction requires modifying the
contentslist, which risks violating Gemini API constraints (model turns must match preceding tool turns). Auto-spill achieves ~90% of the token reduction with zero risk of breaking the conversation structure. - If reversed: Token usage would increase ~15-30%. Per-call context would grow unbounded. Sessions would hit the context limit much sooner. Conversation compaction could be added on top of auto-spill in the future, but is not needed with current workloads.
Iteration count over cumulative token budget
- Chosen:
max_iterationsas the primary cost control lever. - Alternative: Cumulative token budget (stop when total billed tokens exceed a threshold).
- Why: With full history replay, each API call re-sends everything. Cumulative billed tokens double-count: call 1 = 5K, call 2 = 10K (including 5K again), total = 15K billed but only 10K new content. With auto-spill keeping per-call size bounded, iteration count is a much simpler and more predictable proxy for actual cost.
- If reversed: The budget model would be confusing and inaccurate. Cost estimates would be wrong. The cumulative metric is still logged for observability, but it does not drive termination.
ToolDef notes in system prompt, not tool schema
- Chosen: Domain knowledge (Konflux dual-fetch, TF XML structure) goes
in
ToolDef.notes, injected into the system prompt. - Alternative: Put everything in the tool schema
description. - Why: Tool schemas have character limits and are sent in the
toolsfield of every API call. Long descriptions waste tokens on tool defs. System prompt notes are sent once and can be arbitrarily detailed. - If reversed: Tool descriptions would be bloated. Domain knowledge
would need to be duplicated in every workflow
.mdfile.
Full history replay (no compaction)
- Chosen: The
contentslist grows monotonically. Items are never removed or modified (except empty response retries). - Alternative: Compact old turns to reduce context size.
- Why: The Gemini API requires strict turn-by-turn structure. Modifying or removing items risks invalid conversation structure. Auto-spill handles the growth problem at the source (preventing large items from entering history).
- If reversed: Risk of Gemini API errors from malformed conversation structure. Risk of confusing the model (it references previous results that have been summarized away).
fetch_to_sandbox kept alongside auto-spill
- Chosen: Both
fetch_to_sandboxand auto-spill coexist. - Alternative: Remove
fetch_to_sandboxsince auto-spill handles large outputs automatically. - Why:
fetch_to_sandboxprovides explicit path control (model can choose meaningful filenames), batch fetching (one tool call for multiple fetches), and no-preview responses (useful when the model will process with jq anyway). - If reversed: The model would lose path control and batch fetching would require multiple auto-spilled calls. The workflow would need more iterations to achieve the same result.
K8s library for lifecycle + kubectl for exec (hybrid)
- Chosen: Use the
kubernetesPython library for pod create/read/delete,kubectlsubprocess for exec. - Alternative: All-kubectl (subprocess for everything) or all-library.
- Why: The library provides typed pod status polling and clean error
handling for lifecycle.
kubectlprovides reliable exec (see “kubectl exec over kubernetes Python exec API” above). Using the library for exec would require solving the three WebSocket bugs. - If reversed: Either lifecycle management would be fragile (parsing kubectl JSON output for pod status) or exec would be unreliable.
Single-file YAML config over multiple config files
- Chosen: Everything in one
config.yamlwithsettingsandworkflowssections. - Alternative: Separate files per workflow, or settings.yaml + workflows.yaml.
- Why: Single source of truth. All project-workflow mappings visible in one place. Settings-level defaults flow down to all projects. No file discovery logic needed.
- If reversed: Token resolution would need cross-file lookups. Project index would need multi-file aggregation. Config validation would be more complex.
Workflow .md as pure prompt, metadata in YAML
- Chosen: The workflow
.mdfile is pure system prompt text. Metadata (action,model,max_iterations,data_sources,projects) lives in the YAML config. - Alternative: YAML frontmatter in the
.mdfile (original design). - Why: Separation of concerns. The
.mdfile is the model’s instructions – it should be readable and editable by anyone writing prompts. The YAML config is the orchestrator’s instructions – it controls routing, limits, and credentials. Mixing them in one file conflates two audiences. - If reversed: Prompt authors would need to understand YAML config
structure. Token/project config would be scattered across
.mdfiles instead of centralized.
Raw HTTP for Gemini instead of google-genai SDK
- Chosen:
requests+google-authfor Gemini API calls. - Alternative:
google-genaiPython SDK. - Why: The SDK brings ~22 MB of transitive deps (pydantic, httpx,
websockets). The Gemini REST API is simple camelCase JSON over HTTPS –
one endpoint, one request format. Raw HTTP enables: smaller container,
simpler tests (mock
requests.post), plain dict conversation history (easy session serialization), no SDK breakage risk, and amodels/package structure that supports multiple backends. - If reversed: Container would be ~22 MB larger.
contentswould use SDK objects instead of plain dicts, complicating session serialization. Adding Claude support would require a separate approach.
Nudge via system prompt suffix, not user message
- Chosen: Empty response nudge and budget warnings are appended to the system prompt as suffixes.
- Alternative: Inject synthetic user messages into
contents. - Why: User messages in
contentsmust come from the actual user (the initial event, or a reply). Injecting synthetic messages pollutes the conversation history that is saved in sessions. System prompt suffixes are transient – they affect one API call without permanently modifying the conversation state. - If reversed: Saved sessions would contain synthetic user messages. Resumed conversations would be confusing. The model might respond to the synthetic messages instead of the user’s actual input.
Session ID reuse over new-ID-per-resume
- Chosen: Resumed sessions keep the original session ID. S3 state is overwritten in place.
- Alternative: Each resume generates a new UUID, saving alongside the original (append-only tree of sessions).
- Why: GitLab discussions are linear, not branching. There is no
scenario where two different sessions from the same thread are both
valid. A new UUID per resume creates orphaned S3 snapshots that are
never referenced again, since
find_session_for_reply()always picks the latest marker. Reusing the ID is simpler, uses less storage, and matches the linear conversation model. - If reversed: S3 would accumulate orphaned session snapshots. Each resume would need a new note marker, but the old markers would still be in the thread, creating confusion about which session is current.
Reply to unauthorized agent-directed notes instead of silent skip
- Chosen: When an unauthorized user sends a slash command or replies to an agent thread, the agent replies in the same discussion with a short access-denied message. Non-agent-directed notes are silently ignored (debug log, no access check).
- Alternative: Log a warning and skip silently for all unauthorized notes (the original behavior).
- Why: Silent skip gives no feedback to someone who intentionally tried to engage the agent, which is confusing. On the other hand, checking access and logging a warning for every random comment on a public project is noisy and pointless. Moving the access check after intent detection cleanly separates the two cases. The reply uses the discussions API so it automatically inherits confidentiality from the parent note/thread.
- If reversed: Users without sufficient access who try
/hummingbirdwould get no feedback. The agent log would be noisy with warnings for every comment on public MRs.
Deployment-backed pod pool over Python-thread pool
- Chosen: A Kubernetes Deployment maintains pre-warmed standby pods. The agent claims a pod by relabeling it; the Deployment replaces it.
- Alternative: A Python-side thread pool that pre-creates pods and queues them for use.
- Why: The Deployment provides self-healing (restart crashed pods),
native scaling (
kubectl scale), rolling updates (image changes), and monitoring via standard K8s tooling. A Python pool would need to reimplement all of these. - If reversed: Pod replenishment, crash recovery, and image updates would all need custom code. Scaling would require agent redeployment.
Pod isolation: never reuse sandbox pods
- Chosen: Each workflow run gets its own pod. After cleanup, the pod is deleted. Claimed pods are detached from the ReplicaSet.
- Alternative: Return used pods to the pool and reset them.
- Why: Residual state from a previous run (files, environment, running processes) could leak between MR investigations, creating security and correctness risks. Deletion is simple and foolproof.
- If reversed: Would need a reliable pod-reset mechanism and auditing that no state survives between runs.
Absolute reap-by deadline over relative claimed-at age
- Chosen: Active pods are annotated with
hummingbird/reap-by(an absolute UTC timestamp). The reaper deletes pods wherenow > reap-by. - Alternative: Annotate with
claimed-atand compute age relative tomax_active_seconds; or useactiveDeadlineSecondson the pod spec. - Why: An absolute deadline simplifies the reaper to a single
comparison. It also supports varying deadlines: claim sets
reap-by = now + max_active_seconds, while linger setsreap-by = now + linger_seconds. With a relative timestamp, the reaper would need to know which mode the pod is in.activeDeadlineSecondsapplies from pod creation, not from claim – standby pods would expire before being used. - If reversed: The reaper would need mode-aware age calculations. Lingering pods would require a separate annotation or reaper path.
Indefinite claim wait over timeout
- Chosen:
SandboxPool.claim()polls indefinitely (governed by shutdown event), logging at DEBUG then WARNING. - Alternative: Timeout after N seconds and raise an error.
- Why: The pool is typically smaller than
max_concurrent_agentsfor cost reasons. Waiting for a replacement pod is normal operational behavior, not an error. A timeout would cause spurious failures during burst traffic. The SQS semaphore already bounds concurrency. - If reversed: Burst traffic would cause avoidable failures instead of brief delays.
Pool config in Deployment only, not in agent configmap
- Chosen: In pool mode (
k8spool), the pod image, resources, labels, and security context are defined solely in the Deployment template. The agent config only needsnamespaceandactive_deadline_seconds. - Alternative: Keep image/resources/metadata in the agent configmap too
(as in
k8smode). - Why: Eliminates config duplication. The Deployment template is the single source of truth. Changes to pod resources or image only require updating one place and re-rolling the Deployment.
- If reversed: Config drift between the Deployment and the agent configmap would be a constant risk.
Pod lingering over immediate cleanup for reply latency
- Chosen: After a successful workflow, pool sandbox pods linger for
linger_seconds(default 300) with asession-idannotation. Replies within the window reclaim the pod viatry_reclaim(), skipping pod creation and S3 archive restoration. - Alternative: Always delete the pod immediately and restore from S3 on every reply.
- Why: Reply latency drops from seconds (pod claim + S3 restore) to
near-zero. The S3 archive is still saved as a fallback if the pod is
gone. The
reap-byannotation ensures lingering pods are cleaned up if no reply arrives. Thesession-idannotation doubles as a concurrency guard: it is only present while lingering, preventing active pods from being reclaimed. - If reversed: Every reply would pay full pod + restore latency, even for immediate follow-ups. User experience for conversational interactions would degrade noticeably.
linger() on Sandbox protocol over isinstance checks in runner
- Chosen: All sandbox backends implement
linger(session_id). Non-pool backends fall back tocleanup(). The runner callssb.linger()without type checks. - Alternative:
isinstance(sb, K8sPoolSandbox)inrun_workflow(). - Why: Keeps the runner backend-agnostic. Adding a new backend requires only implementing the protocol, not touching the runner. The fallback behavior is co-located with each backend.
- If reversed: The runner would need to know about every backend type and their linger capabilities.
GCP SA key over Workload Identity Federation
- Chosen: GCP service account key stored in Vault, rotated via cki-tools credential manager (prepare/switch/clean cycle).
- Alternative: Workload Identity Federation (WIF) with projected SA token. Two sub-options: (a) automatic OIDC discovery if the cluster issuer is public, (b) manual JWKS upload for internal issuers.
- Why: mpp-prod’s OIDC issuer is
https://kubernetes.default.svc(internal, not publicly reachable), so GCP STS cannot discover it automatically. Manual JWKS upload works but requires re-upload after SRE-triggered signing key rotations. SA key integrates with the existing credential manager rotation infrastructure (same pattern as AWS keys and GitLab tokens), needs no OIDC reachability, and enables automated validate/update viagoogle-authin CI. The application code (VertexAuth/google.auth.default()) works identically with both approaches – switching to WIF later requires only infrastructure changes. - If reversed: Replace SA key with WIF projected token + credential
config ConfigMap. The
gcp_service_account_keytoken type in cki-tools would no longer be needed for this use case.
Sliding-window breakpoints over per-component breakpoints (Claude)
- Chosen: Two breakpoints on messages (B1 at the previous write position, B2 at the latest message) that slide forward each turn.
- Alternative: Separate breakpoints on system prompt, tools, and messages (using 3-4 of the 4 available slots).
- Why: The Anthropic prefix hash is cumulative – it covers everything from the start of the request (tools, system, messages) up to the breakpoint. A single breakpoint on a message already caches the entire prefix. Separate breakpoints on earlier components would be redundant and waste slots. Two sliding breakpoints cover the full conversation history with cache reads on every turn after the first.
- If reversed: Three breakpoint slots wasted on content already covered by the message breakpoint. Only one slot left for the sliding window, making it impossible to have both a read (B1) and write (B2) breakpoint on messages.
Developer+ trust filtering over unfiltered or sanitized discussion comments
- Chosen:
gitlab_get_mr_discussionschecks each note author’s project access level (Developer+ / >= 30). Untrusted notes are replaced with a fixed placeholder in mixed-trust discussions, or the entire discussion is dropped if all notes are untrusted. Agent notes are always trusted (detected by session marker) but have transcripts and metadata stripped. - Alternatives considered:
- (a) Include all comments unfiltered. Simplest, but any external contributor can craft comments that manipulate the model’s review output (prompt injection).
- (b) Sanitize/escape untrusted content (strip markdown, quote as code blocks, prefix with “[external]”). There is no reliable escaping mechanism for LLM prompts – the model interprets natural language regardless of formatting. Escaping gives a false sense of security.
- (c) Only include agent-authored notes (skip all human comments). Safe, but defeats the purpose: the model would never see developer responses to its own findings.
- (d) Include only discussions where the agent participated. Better, but still misses developer-initiated review threads that provide relevant context.
- Why: Developer+ is the same threshold used for pipeline trigger authorization (invariant #7) and slash command access. It matches the trust boundary already established: people who can push code and approve MRs are trusted to provide review context. The placeholder approach preserves discussion structure (the model sees that someone replied) without exposing the content. Full-drop for all-untrusted discussions avoids noise from discussions that contain zero useful context.
- If reversed: (a) opens a prompt injection vector on any project that accepts external MRs. (b) provides no actual protection. (c) makes follow-up reviews unable to see developer explanations, causing repeated false positives. (d) misses developer-initiated context.
Bot-author filtering for session marker parsing over unfiltered note scanning
- Chosen:
find_session_for_reply(),scan_agent_threads(), and thehandle_note()self-filter all resolve the bot user ID viaget_bot_user_id()(which callsgl.auth()on the orchestrator token) and only consider notes whereauthor.idmatches. Infind_session_for_reply(), the first matching marker wins. - Alternative: Parse markers from any note and keep the last match;
substring self-filter in
handle_note(previous behavior). - Why: The previous unfiltered approach allowed session hijacking: a
user reply or agent review prose containing an example marker
(
<!-- hummingbird-session: ... -->) was parsed as a real session, causing “session expired” errors in production. Forhandle_note, the substring check caused the reverse problem: a user embedding the marker prefix in a reply silently suppressed the event (denial of service). Thegl.auth()call is oneGET /userper invocation – negligible cost._get_client_with_bot_id()returns both the authenticated client and the bot user ID, avoiding redundant client construction. - If reversed: Any MR participant can inject a marker to hijack the session ID or inflate rate-limit counts. The agent’s own review prose containing example markers causes spurious “session expired” messages.
ijson streaming over buffered JSON for K8s list responses
- Chosen:
session.get(stream=True)+ijson.parse(resp.raw)foriter_paginated(). Items are yielded one at a time viaObjectBuilder; themetadata.continuepagination token is captured in the same parse pass.resp.raw.decode_content = Trueis required because Kubearchive returns gzip Content-Encoding; without itijsonsees compressed bytes. - Alternative:
session.get().json()loads the full page into memory (the original implementation). - Why: A single K8s list page can contain 500 PipelineRuns (80+ MB
JSON). Parsing this with
.json()creates a +247 MB RSS spike (raw bytes + decoded string + parsed dict coexist). With 4 concurrent workflows, this exceeds any reasonable pod memory limit.ijsonstream-parsing yields items one at a time, reducing the peak to +12 MB for the same data – a 95% reduction. The continue token appears in themetadataobject (before or afteritemsdepending on the server); the event-driven parser captures it regardless of order. - If reversed: Large MRs (100+ components) would OOM the agent pod. Concurrent workflows would multiply the problem. The pod memory limit would need to scale with the largest possible page size, which is unbounded.