Hummingbird Agent Model Loop

How the Hummingbird agent communicates with the LLM (Gemini or Claude), calls tools, and how user replies integrate into the conversation.

For the architectural design rationale behind the loop (why full history replay, why iteration-based budgets, why nudge via system prompt), see Agent Design – Section 6. For operational usage documentation, see Hummingbird Agent.

What gets sent to the model

Every call to the model API sends three things:

1. systemInstruction

A single text string, rebuilt every iteration. Gemini sends it as systemInstruction.parts[].text; Anthropic sends it as a top-level system string. Composed of layers:

BASE_SYSTEM_PROMPT            # agent.py: sandbox rules, tool usage tips
+ tool_notes                  # per-data-source notes from ToolRegistry
+ workflow_prompt             # full content of e.g. workflows/analyze-failures.md

Near the iteration or context limit, a warning suffix is appended to the system prompt for that call. The suffix escalates through three levels:

  • ITERATION_WARNING / CONTEXT_WARNING – soft “start wrapping up” at 80% of the iteration or context limit. Tools remain available.
  • FINAL_TURN_WARNING – hard stop on the absolute last iteration or after the context limit is exceeded. Combined with tool-calling disabled for that request (toolConfig.functionCallingConfig.mode: NONE on Gemini, tool_choice.type: none on Anthropic) to mechanically prevent further tool calls.
  • EMPTY_RESPONSE_NUDGE – appended when the model returns an empty response (no text, no tool calls). Tools remain available.

2. contents

A list of message dicts – the full conversation history. Grows every iteration. Each entry has a role (user, model, or tool) and parts. On Anthropic, tool results use role: "user" (not a separate role); the adapter merges consecutive user messages so the wire shape matches what each API expects.

{
  "contents": [
    {"role": "user",  "parts": [{"text": "{\"project\":\"org/repo\",\"iid\":42,...}"}]},
    {"role": "model", "parts": [{"functionCall": {"name": "gitlab_get_mr_details", "args": {...}}}]},
    {"role": "tool",  "parts": [{"functionResponse": {"name": "gitlab_get_mr_details", "response": {...}}}]},
    ...
  ]
}

3. tools

Tool definitions, static across all iterations. Gemini uses functionDeclarations (JSON Schema in parametersJsonSchema); Anthropic uses input_schema per tool.

{
  "tools": [{"functionDeclarations": [
    {"name": "sandbox_exec", "description": "...", "parametersJsonSchema": {...}},
    {"name": "fetch_to_sandbox", ...},
    {"name": "fetch_batch_to_sandbox", ...},
    {"name": "gitlab_get_mr_details", ...},
    {"name": "konflux_list_pipelineruns", ...}
  ]}]
}

4. toolConfig (conditional)

On the final turn (last iteration or after context limit exceeded), the request disables tools to mechanically prevent function calls. Gemini:

{
  "toolConfig": {"functionCallingConfig": {"mode": "NONE"}}
}

Anthropic equivalent: tool_choice: {"type": "none"}.

This is only sent when FINAL_TURN_WARNING is active. All other iterations omit this constraint, allowing the model to choose freely between text and tools.

The agent loop, turn by turn

The loop in run_agent_loop runs up to max_iterations times. Each iteration is one round-trip to the model.

Initialization

system_prompt = BASE_SYSTEM_PROMPT + tool_notes + workflow_prompt
tool_defs     = [sandbox_exec, fetch_to_sandbox, fetch_batch_to_sandbox, <data sources...>]
contents      = [user: {"project": "org/repo", "iid": 42, "sha": "abc...", "session_id": "uuid"}]

Iteration 1

-> Send: system_instruction + contents (1 item) + tool_defs
<- Response: functionCall(gitlab_get_mr_details, {project: "org/repo", iid: 42})

contents.append(model: response.raw_content)
  execute tool -> result = {"result": "{\"title\": \"Fix auth\",...}"}
contents.append(tool: model.make_tool_responses([(name, result)]))

Iteration 2

-> Send: system_instruction + contents (3 items) + tool_defs
<- Response: functionCall(fetch_batch_to_sandbox, {requests: [...]})

contents.append(model: ...)
  execute tool -> result = {"results": [{"saved_to": "/tmp/data/pipelineruns.json", "bytes": 85432}]}
contents.append(tool: ...)

Iteration 3

-> Send: system_instruction + contents (5 items) + tool_defs
<- Response: functionCall(sandbox_exec, {command: "jq '[...]' /tmp/data/pipelineruns.json"})

contents.append(model: ...)
  execute tool -> result = {"exit_code": 0, "stdout": "...(preview)...", "stdout_file": "/tmp/_out/0.txt"}
contents.append(tool: ...)

Iteration N (final)

-> Send: system_instruction + contents (2N-1 items) + tool_defs
<- Response: text("## Konflux Failure Analysis\n\n...")    // text + no tool_calls = DONE

contents.append(model: {text: "## Konflux Failure Analysis..."})
BREAK -- return (text, usage, transcript, contents)

Response handling and termination

Each iteration, the model’s response can contain text, tool calls, both, or neither. The agent handles each case:

Response Action
Text only Final response. Save text, break. (Happy path.)
Text + tool calls Save text as last_text fallback, execute the tool calls, continue loop. The model is thinking aloud while also acting.
Tool calls only Execute the tool calls, continue loop.
Neither Empty response – retry up to MAX_EMPTY_RETRIES (2) times. Each retry nudges the model via EMPTY_RESPONSE_NUDGE appended to the system prompt (not injected as a user message, to keep the contents list clean). Tools remain available during nudge retries.

The text + tool calls case is worth noting: the model’s text is not returned to the user immediately. It is stored as last_text (a fallback in case the loop terminates later without a clean final text, e.g. by hitting max iterations). The raw_content dict – which includes both the text and functionCall parts – is appended to contents as a single model turn:

{"role": "model", "parts": [
    {"text": "Let me check the clair-scan logs..."},
    {"functionCall": {"name": "sandbox_exec", "args": {"command": "grep timeout ..."}}}
]}

The loop terminates on:

  1. Text only – the model produced a final response.
  2. Max iterations reached – the final iteration uses FINAL_TURN_WARNING + toolConfig NONE to force text output. Falls back to whatever last_text was seen, or "[Agent did not produce a final response]".
  3. Context limit exceededinput_tokens >= CONTEXT_LIMIT sets context_exceeded, making the next iteration final (same as #2). A soft warning (CONTEXT_WARNING) fires earlier at 80% of the limit.
  4. Fatal model error – HTTP error, timeout, or connection error after retries exhausted. Transport errors (timeouts, connection failures) are wrapped as retryable ModelError and retried with exponential backoff alongside HTTP 5xx/429.
  5. Empty responses exhaustedMAX_EMPTY_RETRIES (2) nudge retries failed.
  6. Unexpected error – any exception not handled by the retry mechanism (e.g. malformed API response) breaks the loop and returns partial results. Accumulated contents, transcript, and sandbox are preserved and the session is saved normally.

Budget escalation

Both iterations and context size use the same two-tier pattern:

Soft warning (80%) Hard stop (100%)
Iterations ITERATION_WARNING at 80% of max_iterations FINAL_TURN_WARNING on last iteration
Context CONTEXT_WARNING at 80% of CONTEXT_LIMIT FINAL_TURN_WARNING on next iteration after exceeding CONTEXT_LIMIT

Soft warnings are advisory (“start wrapping up”) and keep tools available. The hard stop uses toolConfig.mode: NONE to mechanically prevent further tool calls, ensuring the model produces text.

The contents list in detail

Each entry in contents follows the provider’s native wire format during execution.

Canonical format (persistence)

During the agent loop, contents use the provider’s native format. On save, to_canonical() converts them to OpenAI Chat Completions-style messages. On load, from_canonical() converts back to the current provider’s native format. That enables cross-provider session resumption.

User turn

{"role": "user", "parts": [{"text": "..."}]}

Created by model.make_user_content(text). In a cold start there is exactly one user turn at the start: the JSON event. No additional user turns appear during a normal run – empty-response nudges are delivered via the system prompt, not as user messages.

Model turn

{"role": "model", "parts": [
    {"functionCall": {"name": "sandbox_exec", "args": {"command": "jq ..."}}}
]}

Or for the final response:

{"role": "model", "parts": [{"text": "## Konflux Failure Analysis..."}]}

This is response.raw_content – the exact dict from the model response candidate, appended verbatim. Can contain text, tool calls, or both. Parallel tool calls appear as multiple functionCall parts in one model turn.

Tool turn

{"role": "tool", "parts": [
    {"functionResponse": {"name": "sandbox_exec", "response": {"exit_code": 0, "stdout": "..."}}}
]}

Created by model.make_tool_responses(results). One functionResponse part per tool call in the preceding model turn. Tool results are always JSON dicts; large outputs are spilled to sandbox files and only a preview is included.

Typical 8-iteration conversation shape

contents[0]  = user:  {"project":"org/repo","iid":42,...}          # initial event
contents[1]  = model: functionCall(gitlab_get_mr_details)          # iter 1
contents[2]  = tool:  functionResponse(gitlab_get_mr_details)      # iter 1
contents[3]  = model: functionCall(fetch_batch_to_sandbox)         # iter 2
contents[4]  = tool:  functionResponse(fetch_batch_to_sandbox)     # iter 2
contents[5]  = model: functionCall(sandbox_exec)                   # iter 3
contents[6]  = tool:  functionResponse(sandbox_exec)               # iter 3
...
contents[13] = model: functionCall(sandbox_exec)                   # iter 7
contents[14] = tool:  functionResponse(sandbox_exec)               # iter 7
contents[15] = model: text("## Konflux Failure Analysis...")       # iter 8 (final)

This entire list is persisted as context.json in the session (see Canonical format (persistence) above). It is the full state needed to resume a conversation.

How tools are called

When the model’s response contains functionCall parts, _execute_tool_calls iterates them sequentially:

tool_registry.execute(ToolCall(name, args))
  match name:
     "sandbox_exec"           -> runs sh -c in container -> {exit_code, stdout, stderr}
     "fetch_to_sandbox"       -> calls data source func -> writes to sandbox file -> {saved_to, bytes}
     "fetch_batch_to_sandbox" -> multiple fetch_to_sandbox in one call
     any data source name     -> calls func directly -> returns inline or auto-spills large output

Large output handling

When stdout or a data source response exceeds 4KB, it is automatically saved to a sandbox file (/tmp/_out/N.txt) and only a preview (head + tail) is returned to the model. The model gets the file path and can use sandbox_exec with jq/grep/head to process it. This keeps the context window manageable.

How a user reply integrates (session resumption)

When a user replies to an agent note on GitLab, the orchestrator resumes the conversation by restoring the previous state and appending the reply.

What gets restored from S3

  • context.json – the persisted conversation in canonical (OpenAI Chat Completions-style) form; from_canonical() maps it to the active provider’s native contents before the loop runs
  • sandbox.tar.gz/tmp/_out/ files (PipelineRun JSONs, test logs, jq output, etc.) restored into the new sandbox

Because load uses from_canonical(), you can switch models across providers (e.g. Gemini to Claude or the reverse) when resuming, as long as the session was saved with canonical contents.

The resumed contents list

// Restored from context.json (previous run)
contents[0]  = user:  {"project":"org/repo","iid":42,...}           # original event
contents[1]  = model: functionCall(gitlab_get_mr_details)           # iter 1
contents[2]  = tool:  functionResponse(gitlab_get_mr_details)       # iter 1
...                                                                  # all prior turns
contents[15] = model: text("## Konflux Failure Analysis...")        # previous final text

// NEW: user reply appended
contents[16] = user:  "Can you look at the clair-scan timeout more closely?"

The agent loop continues

ITERATION 1 (resumed):
  -> Send: system_instruction (+ CONTINUATION_PROMPT) + contents[0..16] + tool_defs
  <- Response: functionCall(sandbox_exec, {command: "grep -i timeout /tmp/data/..."})
                                                      ^ using restored sandbox files
  contents[17] = model: functionCall(sandbox_exec)
  contents[18] = tool:  functionResponse(sandbox_exec)

ITERATION 2 (resumed):
  <- Response: text("The clair-scan timeout is caused by...")
  contents[19] = model: text("The clair-scan timeout is caused by...")
  BREAK

Cold start vs resumed – key differences

Aspect Cold start Resumed
System prompt BASE + tool_notes + workflow BASE + tool_notes + workflow + CONTINUATION_PROMPT
First user message {"project":..., "iid":...} {"project":..., "iid":...} (restored)
Conversation history Empty (just the event) Full prior conversation
Latest user message "Can you look at the clair-scan timeout?"
Sandbox files Empty Restored from archive
Tool definitions Same Same
Session format Native (per provider) in memory Canonical on disk; native after from_canonical()

The CONTINUATION_PROMPT

Without this, the workflow prompt (e.g. analyze-failures.md) tells the model to follow a rigid workflow: Data.1, Data.2, Data.3, Analysis.1… The model might try to re-run the entire analysis. The continuation prompt overrides this:

## Continuation

This is a follow-up to a previous conversation. The conversation history
contains your prior analysis and tool calls. The user is replying with a
question or request about your previous work.

**Do NOT re-run the full workflow from scratch.** Instead:
- Respond directly to the user's question
- Use your tools to investigate further if needed (logs, data are still
  available in the sandbox)
- Reference your previous findings where relevant
- Keep your response focused on what the user asked

The model now understands: “I already did the analysis (it’s all in the conversation history). The user has a specific question. Let me answer it.”

Iteration and token budget for resumed sessions

The resumed session gets a full fresh iteration budget. run_workflow sets max_iterations from the workflow config (e.g. 50 for analyze-failures), and the loop counter starts at 0 regardless of how many iterations the previous session used. The model can make as many tool calls as it needs to answer the follow-up.

The practical constraint is the context window, not iterations. The restored contents can be large – a typical 8-iteration cold start uses ~50-100K input tokens. Since the model re-counts the full history every iteration, the resumed session starts at roughly the token count where the previous session ended, plus the new user message. Each new tool call adds more. Token counts logged by the agent are billed tokens (cumulative across API calls; each call re-sends the full history, so these overlap).

The existing CONTEXT_LIMIT check still applies: at 80% of the limit, a soft CONTEXT_WARNING tells the model to start wrapping up. If the limit is exceeded, the next iteration becomes final (FINAL_TURN_WARNING + toolConfig NONE).

For the typical case (user asks one focused follow-up, agent does 1-3 tool calls), this is fine. Multi-turn deep dives will eventually hit the context limit, at which point the agent wraps up – and the user can start a fresh conversation if needed.

Prompt caching

Since the agent replays the full conversation history on every API call, prompt caching significantly reduces the cost of repeated prefixes. Both providers support caching, but with different mechanisms.

Gemini (implicit)

Gemini’s Vertex AI backend automatically caches repeated request prefixes server-side. No request-side annotations are needed. The agent reads cachedContentTokenCount from usageMetadata in each response and reports it as cache_read_tokens. There is no cache_creation_tokens for Gemini – implicit caching has no write surcharge, only a read discount (25% of the base input rate).

Claude (explicit breakpoints)

Vertex AI does not support Anthropic’s automatic caching. The agent places explicit cache_control: {"type": "ephemeral"} annotations on message content blocks. The API hashes the cumulative prefix – tools, system prompt, and all messages from the start of the request up to the annotated block – and caches the result. A breakpoint on a message therefore covers everything before it; separate breakpoints on the system prompt or tools would be redundant.

On the wire, an annotated message looks like this:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe the sandbox.", "cache_control": {"type": "ephemeral"}}
  ]
}

For messages with multiple content blocks (e.g. tool_result arrays), the cache_control is placed on the last block in the array.

Sliding-window breakpoints

The agent uses two of the four available breakpoint slots per request. A sliding pair of breakpoints (B1, B2) ensures the full conversation prefix is cached and only the newest turn is processed at full price:

  • B2 is placed on the last message (writes the current prefix to cache)
  • B1 is placed where B2 was on the previous call (reads the prior prefix from cache)
Call 1:  [msg0 B2]
         B2: WRITE entire prefix to cache

Call 2:  [msg0 B1] [msg1] [msg2 B2]
         B1: READ  (matches call 1's B2 -- same position, same prefix)
         B2: WRITE (extends cache to include new messages)

Call 3:  [msg0] [msg1] [msg2 B1] [msg3] [msg4 B2]
         B1: READ  (matches call 2's B2)
         B2: WRITE

Call N:  ... [msg(N-2) B1] [msg(N-1)] [msg(N) B2]
         B1: READ  (matches call N-1's B2)
         B2: WRITE

The cache_control metadata is not part of the prefix hash. Moving B1 to a position that previously had B2 (and no longer has cache_control) does not invalidate the cache – the hash matches because the content is identical.

The cache has a 5-minute TTL, refreshed on each hit. Cache writes cost 1.25x the base input rate (25% surcharge); cache reads cost 0.10x (90% discount). The minimum cacheable prefix length is 1024 tokens for Sonnet/Opus and 4096 tokens for Haiku – the system prompt alone exceeds these thresholds.

Ephemeral turns

When the agent loop injects a transient message (iteration warning, context warning, empty-response nudge), the generate() call receives ephemeral=True. The transient message is appended as a user message, merged with the preceding tool-result user message by _merge_consecutive_user, and popped from contents after the call.

  • B1 is still placed at the previous B2 position, so the cache read for the stable prefix still works
  • B2 is placed on the second-to-last merged message (the last stable message before the merged ephemeral tail), so the cached prefix advances without including transient content
  • The internal B2 position advances to the second-to-last index, so successive ephemeral turns keep sliding B1 forward

This ensures the B1=previous-B2 invariant holds through ephemeral turns:

T1:     U1·B2                                          prev=0
T2:     U1·B1  M1  U2·B2                               prev=2
T3(E):  U1  M1  U2·B1  M2·B2  U3+E                    prev=3
T4(E):  U1  M1  U2  M2·B1  U3  M3·B2  U4+E            prev=5
T5:     U1  M1  U2  M2  U3  M3·B1  U4  M4  U5·B2      prev=8

(U = user message, M = model message, E = ephemeral, +E = merged with preceding user message.)

When the merged message list has fewer than two entries during an ephemeral call (e.g. a single merged user+ephemeral on the first turn), no breakpoints are placed and the B2 position is not updated.

Thinking blocks

Extended thinking blocks in assistant responses are part of the cached prefix. They do not break cache hits when the following user message contains only tool_result blocks (the normal case in the agent loop).

Caching and session resumption

After session resumption, the model instance is fresh and has no record of previous breakpoint positions. The first call has no B1 (no cache read), only B2 (cache write). From call 2 onward, caching works normally. This is the same behavior as a cold start.