Hummingbird Agent Model Loop
How the Hummingbird agent communicates with the LLM (Gemini or Claude), calls tools, and how user replies integrate into the conversation.
For the architectural design rationale behind the loop (why full history replay, why iteration-based budgets, why nudge via system prompt), see Agent Design – Section 6. For operational usage documentation, see Hummingbird Agent.
What gets sent to the model
Every call to the model API sends three things:
1. systemInstruction
A single text string, rebuilt every iteration. Gemini sends it as
systemInstruction.parts[].text; Anthropic sends it as a top-level system
string. Composed of layers:
BASE_SYSTEM_PROMPT # agent.py: sandbox rules, tool usage tips
+ tool_notes # per-data-source notes from ToolRegistry
+ workflow_prompt # full content of e.g. workflows/analyze-failures.md
Near the iteration or context limit, a warning suffix is appended to the system prompt for that call. The suffix escalates through three levels:
ITERATION_WARNING/CONTEXT_WARNING– soft “start wrapping up” at 80% of the iteration or context limit. Tools remain available.FINAL_TURN_WARNING– hard stop on the absolute last iteration or after the context limit is exceeded. Combined with tool-calling disabled for that request (toolConfig.functionCallingConfig.mode: NONEon Gemini,tool_choice.type: noneon Anthropic) to mechanically prevent further tool calls.EMPTY_RESPONSE_NUDGE– appended when the model returns an empty response (no text, no tool calls). Tools remain available.
2. contents
A list of message dicts – the full conversation history. Grows every
iteration. Each entry has a role (user, model, or tool) and parts.
On Anthropic, tool results use role: "user" (not a separate role); the
adapter merges consecutive user messages so the wire shape matches what each
API expects.
{
"contents": [
{"role": "user", "parts": [{"text": "{\"project\":\"org/repo\",\"iid\":42,...}"}]},
{"role": "model", "parts": [{"functionCall": {"name": "gitlab_get_mr_details", "args": {...}}}]},
{"role": "tool", "parts": [{"functionResponse": {"name": "gitlab_get_mr_details", "response": {...}}}]},
...
]
}
3. tools
Tool definitions, static across all iterations. Gemini uses
functionDeclarations (JSON Schema in parametersJsonSchema); Anthropic uses
input_schema per tool.
{
"tools": [{"functionDeclarations": [
{"name": "sandbox_exec", "description": "...", "parametersJsonSchema": {...}},
{"name": "fetch_to_sandbox", ...},
{"name": "fetch_batch_to_sandbox", ...},
{"name": "gitlab_get_mr_details", ...},
{"name": "konflux_list_pipelineruns", ...}
]}]
}
4. toolConfig (conditional)
On the final turn (last iteration or after context limit exceeded), the request disables tools to mechanically prevent function calls. Gemini:
{
"toolConfig": {"functionCallingConfig": {"mode": "NONE"}}
}
Anthropic equivalent: tool_choice: {"type": "none"}.
This is only sent when FINAL_TURN_WARNING is active. All other iterations
omit this constraint, allowing the model to choose freely between text and tools.
The agent loop, turn by turn
The loop in run_agent_loop runs up to max_iterations times. Each
iteration is one round-trip to the model.
Initialization
system_prompt = BASE_SYSTEM_PROMPT + tool_notes + workflow_prompt
tool_defs = [sandbox_exec, fetch_to_sandbox, fetch_batch_to_sandbox, <data sources...>]
contents = [user: {"project": "org/repo", "iid": 42, "sha": "abc...", "session_id": "uuid"}]
Iteration 1
-> Send: system_instruction + contents (1 item) + tool_defs
<- Response: functionCall(gitlab_get_mr_details, {project: "org/repo", iid: 42})
contents.append(model: response.raw_content)
execute tool -> result = {"result": "{\"title\": \"Fix auth\",...}"}
contents.append(tool: model.make_tool_responses([(name, result)]))
Iteration 2
-> Send: system_instruction + contents (3 items) + tool_defs
<- Response: functionCall(fetch_batch_to_sandbox, {requests: [...]})
contents.append(model: ...)
execute tool -> result = {"results": [{"saved_to": "/tmp/data/pipelineruns.json", "bytes": 85432}]}
contents.append(tool: ...)
Iteration 3
-> Send: system_instruction + contents (5 items) + tool_defs
<- Response: functionCall(sandbox_exec, {command: "jq '[...]' /tmp/data/pipelineruns.json"})
contents.append(model: ...)
execute tool -> result = {"exit_code": 0, "stdout": "...(preview)...", "stdout_file": "/tmp/_out/0.txt"}
contents.append(tool: ...)
Iteration N (final)
-> Send: system_instruction + contents (2N-1 items) + tool_defs
<- Response: text("## Konflux Failure Analysis\n\n...") // text + no tool_calls = DONE
contents.append(model: {text: "## Konflux Failure Analysis..."})
BREAK -- return (text, usage, transcript, contents)
Response handling and termination
Each iteration, the model’s response can contain text, tool calls, both, or neither. The agent handles each case:
| Response | Action |
|---|---|
| Text only | Final response. Save text, break. (Happy path.) |
| Text + tool calls | Save text as last_text fallback, execute the tool calls, continue loop. The model is thinking aloud while also acting. |
| Tool calls only | Execute the tool calls, continue loop. |
| Neither | Empty response – retry up to MAX_EMPTY_RETRIES (2) times. Each retry nudges the model via EMPTY_RESPONSE_NUDGE appended to the system prompt (not injected as a user message, to keep the contents list clean). Tools remain available during nudge retries. |
The text + tool calls case is worth noting: the model’s text is not
returned to the user immediately. It is stored as last_text (a fallback in
case the loop terminates later without a clean final text, e.g. by hitting max
iterations). The raw_content dict – which includes both the text and
functionCall parts – is appended to contents as a single model turn:
{"role": "model", "parts": [
{"text": "Let me check the clair-scan logs..."},
{"functionCall": {"name": "sandbox_exec", "args": {"command": "grep timeout ..."}}}
]}
The loop terminates on:
- Text only – the model produced a final response.
- Max iterations reached – the final iteration uses
FINAL_TURN_WARNING+toolConfig NONEto force text output. Falls back to whateverlast_textwas seen, or"[Agent did not produce a final response]". - Context limit exceeded –
input_tokens >= CONTEXT_LIMITsetscontext_exceeded, making the next iteration final (same as #2). A soft warning (CONTEXT_WARNING) fires earlier at 80% of the limit. - Fatal model error – HTTP error, timeout, or connection error after
retries exhausted. Transport errors (timeouts, connection failures) are
wrapped as retryable
ModelErrorand retried with exponential backoff alongside HTTP 5xx/429. - Empty responses exhausted –
MAX_EMPTY_RETRIES(2) nudge retries failed. - Unexpected error – any exception not handled by the retry mechanism (e.g. malformed API response) breaks the loop and returns partial results. Accumulated contents, transcript, and sandbox are preserved and the session is saved normally.
Budget escalation
Both iterations and context size use the same two-tier pattern:
| Soft warning (80%) | Hard stop (100%) | |
|---|---|---|
| Iterations | ITERATION_WARNING at 80% of max_iterations |
FINAL_TURN_WARNING on last iteration |
| Context | CONTEXT_WARNING at 80% of CONTEXT_LIMIT |
FINAL_TURN_WARNING on next iteration after exceeding CONTEXT_LIMIT |
Soft warnings are advisory (“start wrapping up”) and keep tools available.
The hard stop uses toolConfig.mode: NONE to mechanically prevent further
tool calls, ensuring the model produces text.
The contents list in detail
Each entry in contents follows the provider’s native wire format during execution.
Canonical format (persistence)
During the agent loop, contents use the provider’s native format. On save,
to_canonical() converts them to OpenAI Chat Completions-style messages. On
load, from_canonical() converts back to the current provider’s native
format. That enables cross-provider session resumption.
User turn
{"role": "user", "parts": [{"text": "..."}]}
Created by model.make_user_content(text). In a cold start there is exactly
one user turn at the start: the JSON event. No additional user turns appear
during a normal run – empty-response nudges are delivered via the system
prompt, not as user messages.
Model turn
{"role": "model", "parts": [
{"functionCall": {"name": "sandbox_exec", "args": {"command": "jq ..."}}}
]}
Or for the final response:
{"role": "model", "parts": [{"text": "## Konflux Failure Analysis..."}]}
This is response.raw_content – the exact dict from the model response
candidate, appended verbatim. Can contain text, tool calls, or both. Parallel
tool calls appear as multiple functionCall parts in one model turn.
Tool turn
{"role": "tool", "parts": [
{"functionResponse": {"name": "sandbox_exec", "response": {"exit_code": 0, "stdout": "..."}}}
]}
Created by model.make_tool_responses(results). One functionResponse part
per tool call in the preceding model turn. Tool results are always JSON dicts;
large outputs are spilled to sandbox files and only a preview is included.
Typical 8-iteration conversation shape
contents[0] = user: {"project":"org/repo","iid":42,...} # initial event
contents[1] = model: functionCall(gitlab_get_mr_details) # iter 1
contents[2] = tool: functionResponse(gitlab_get_mr_details) # iter 1
contents[3] = model: functionCall(fetch_batch_to_sandbox) # iter 2
contents[4] = tool: functionResponse(fetch_batch_to_sandbox) # iter 2
contents[5] = model: functionCall(sandbox_exec) # iter 3
contents[6] = tool: functionResponse(sandbox_exec) # iter 3
...
contents[13] = model: functionCall(sandbox_exec) # iter 7
contents[14] = tool: functionResponse(sandbox_exec) # iter 7
contents[15] = model: text("## Konflux Failure Analysis...") # iter 8 (final)
This entire list is persisted as context.json in the session (see Canonical format
(persistence) above). It is the full state needed to resume a conversation.
How tools are called
When the model’s response contains functionCall parts, _execute_tool_calls
iterates them sequentially:
tool_registry.execute(ToolCall(name, args))
match name:
"sandbox_exec" -> runs sh -c in container -> {exit_code, stdout, stderr}
"fetch_to_sandbox" -> calls data source func -> writes to sandbox file -> {saved_to, bytes}
"fetch_batch_to_sandbox" -> multiple fetch_to_sandbox in one call
any data source name -> calls func directly -> returns inline or auto-spills large output
Large output handling
When stdout or a data source response exceeds 4KB, it is automatically saved
to a sandbox file (/tmp/_out/N.txt) and only a preview (head + tail) is
returned to the model. The model gets the file path and can use sandbox_exec
with jq/grep/head to process it. This keeps the context window
manageable.
How a user reply integrates (session resumption)
When a user replies to an agent note on GitLab, the orchestrator resumes the conversation by restoring the previous state and appending the reply.
What gets restored from S3
context.json– the persisted conversation in canonical (OpenAI Chat Completions-style) form;from_canonical()maps it to the active provider’s nativecontentsbefore the loop runssandbox.tar.gz–/tmp/_out/files (PipelineRun JSONs, test logs, jq output, etc.) restored into the new sandbox
Because load uses from_canonical(), you can switch models across providers
(e.g. Gemini to Claude or the reverse) when resuming, as long as the session
was saved with canonical contents.
The resumed contents list
// Restored from context.json (previous run)
contents[0] = user: {"project":"org/repo","iid":42,...} # original event
contents[1] = model: functionCall(gitlab_get_mr_details) # iter 1
contents[2] = tool: functionResponse(gitlab_get_mr_details) # iter 1
... # all prior turns
contents[15] = model: text("## Konflux Failure Analysis...") # previous final text
// NEW: user reply appended
contents[16] = user: "Can you look at the clair-scan timeout more closely?"
The agent loop continues
ITERATION 1 (resumed):
-> Send: system_instruction (+ CONTINUATION_PROMPT) + contents[0..16] + tool_defs
<- Response: functionCall(sandbox_exec, {command: "grep -i timeout /tmp/data/..."})
^ using restored sandbox files
contents[17] = model: functionCall(sandbox_exec)
contents[18] = tool: functionResponse(sandbox_exec)
ITERATION 2 (resumed):
<- Response: text("The clair-scan timeout is caused by...")
contents[19] = model: text("The clair-scan timeout is caused by...")
BREAK
Cold start vs resumed – key differences
| Aspect | Cold start | Resumed |
|---|---|---|
| System prompt | BASE + tool_notes + workflow | BASE + tool_notes + workflow + CONTINUATION_PROMPT |
| First user message | {"project":..., "iid":...} |
{"project":..., "iid":...} (restored) |
| Conversation history | Empty (just the event) | Full prior conversation |
| Latest user message | – | "Can you look at the clair-scan timeout?" |
| Sandbox files | Empty | Restored from archive |
| Tool definitions | Same | Same |
| Session format | Native (per provider) in memory | Canonical on disk; native after from_canonical() |
The CONTINUATION_PROMPT
Without this, the workflow prompt (e.g. analyze-failures.md) tells the model
to follow a rigid workflow: Data.1, Data.2, Data.3, Analysis.1… The model
might try to re-run the entire analysis. The continuation prompt overrides
this:
## Continuation
This is a follow-up to a previous conversation. The conversation history
contains your prior analysis and tool calls. The user is replying with a
question or request about your previous work.
**Do NOT re-run the full workflow from scratch.** Instead:
- Respond directly to the user's question
- Use your tools to investigate further if needed (logs, data are still
available in the sandbox)
- Reference your previous findings where relevant
- Keep your response focused on what the user asked
The model now understands: “I already did the analysis (it’s all in the conversation history). The user has a specific question. Let me answer it.”
Iteration and token budget for resumed sessions
The resumed session gets a full fresh iteration budget. run_workflow
sets max_iterations from the workflow config (e.g. 50 for
analyze-failures), and the loop counter starts at 0 regardless of how many
iterations the previous session used. The model can make as many tool calls
as it needs to answer the follow-up.
The practical constraint is the context window, not iterations. The restored contents can be large – a typical 8-iteration cold start uses ~50-100K input tokens. Since the model re-counts the full history every iteration, the resumed session starts at roughly the token count where the previous session ended, plus the new user message. Each new tool call adds more. Token counts logged by the agent are billed tokens (cumulative across API calls; each call re-sends the full history, so these overlap).
The existing CONTEXT_LIMIT check still applies: at 80% of the limit, a
soft CONTEXT_WARNING tells the model to start wrapping up. If the limit is
exceeded, the next iteration becomes final (FINAL_TURN_WARNING +
toolConfig NONE).
For the typical case (user asks one focused follow-up, agent does 1-3 tool calls), this is fine. Multi-turn deep dives will eventually hit the context limit, at which point the agent wraps up – and the user can start a fresh conversation if needed.
Prompt caching
Since the agent replays the full conversation history on every API call, prompt caching significantly reduces the cost of repeated prefixes. Both providers support caching, but with different mechanisms.
Gemini (implicit)
Gemini’s Vertex AI backend automatically caches repeated request prefixes
server-side. No request-side annotations are needed. The agent reads
cachedContentTokenCount from usageMetadata in each response and reports
it as cache_read_tokens. There is no cache_creation_tokens for Gemini –
implicit caching has no write surcharge, only a read discount (25% of the
base input rate).
Claude (explicit breakpoints)
Vertex AI does not support Anthropic’s automatic caching. The agent places
explicit cache_control: {"type": "ephemeral"} annotations on message
content blocks. The API hashes the cumulative prefix – tools, system prompt,
and all messages from the start of the request up to the annotated block –
and caches the result. A breakpoint on a message therefore covers everything
before it; separate breakpoints on the system prompt or tools would be
redundant.
On the wire, an annotated message looks like this:
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the sandbox.", "cache_control": {"type": "ephemeral"}}
]
}
For messages with multiple content blocks (e.g. tool_result arrays), the
cache_control is placed on the last block in the array.
Sliding-window breakpoints
The agent uses two of the four available breakpoint slots per request. A sliding pair of breakpoints (B1, B2) ensures the full conversation prefix is cached and only the newest turn is processed at full price:
- B2 is placed on the last message (writes the current prefix to cache)
- B1 is placed where B2 was on the previous call (reads the prior prefix from cache)
Call 1: [msg0 B2]
B2: WRITE entire prefix to cache
Call 2: [msg0 B1] [msg1] [msg2 B2]
B1: READ (matches call 1's B2 -- same position, same prefix)
B2: WRITE (extends cache to include new messages)
Call 3: [msg0] [msg1] [msg2 B1] [msg3] [msg4 B2]
B1: READ (matches call 2's B2)
B2: WRITE
Call N: ... [msg(N-2) B1] [msg(N-1)] [msg(N) B2]
B1: READ (matches call N-1's B2)
B2: WRITE
The cache_control metadata is not part of the prefix hash. Moving B1 to a
position that previously had B2 (and no longer has cache_control) does not
invalidate the cache – the hash matches because the content is identical.
The cache has a 5-minute TTL, refreshed on each hit. Cache writes cost 1.25x the base input rate (25% surcharge); cache reads cost 0.10x (90% discount). The minimum cacheable prefix length is 1024 tokens for Sonnet/Opus and 4096 tokens for Haiku – the system prompt alone exceeds these thresholds.
Ephemeral turns
When the agent loop injects a transient message (iteration warning, context
warning, empty-response nudge), the generate() call receives
ephemeral=True. The transient message is appended as a user message,
merged with the preceding tool-result user message by
_merge_consecutive_user, and popped from contents after the call.
- B1 is still placed at the previous B2 position, so the cache read for the stable prefix still works
- B2 is placed on the second-to-last merged message (the last stable message before the merged ephemeral tail), so the cached prefix advances without including transient content
- The internal B2 position advances to the second-to-last index, so successive ephemeral turns keep sliding B1 forward
This ensures the B1=previous-B2 invariant holds through ephemeral turns:
T1: U1·B2 prev=0
T2: U1·B1 M1 U2·B2 prev=2
T3(E): U1 M1 U2·B1 M2·B2 U3+E prev=3
T4(E): U1 M1 U2 M2·B1 U3 M3·B2 U4+E prev=5
T5: U1 M1 U2 M2 U3 M3·B1 U4 M4 U5·B2 prev=8
(U = user message, M = model message, E = ephemeral, +E = merged with preceding user message.)
When the merged message list has fewer than two entries during an ephemeral call (e.g. a single merged user+ephemeral on the first turn), no breakpoints are placed and the B2 position is not updated.
Thinking blocks
Extended thinking blocks in assistant responses are part of the cached
prefix. They do not break cache hits when the following user message contains
only tool_result blocks (the normal case in the agent loop).
Caching and session resumption
After session resumption, the model instance is fresh and has no record of previous breakpoint positions. The first call has no B1 (no cache read), only B2 (cache write). From call 2 onward, caching works normally. This is the same behavior as a cold start.