Gemini 3.1 Pro in Multi-Agent Systems: High Design Quality, 20% Tool-Call Failure Rate

✍️ OpenClawRadar📅 Published: February 25, 2026🔗 Source
Gemini 3.1 Pro in Multi-Agent Systems: High Design Quality, 20% Tool-Call Failure Rate
Ad

Architecture and Testing Context

The team behind Bobr, an AI presentation generator, tested Gemini 3.1 Pro within a two-level agent system. The architecture consists of:

  • Orchestrator Agent: Handles conversation, understands user intent, plans structure, and dispatches work via tool calls.
  • Creative Agent (Gemini 3.1 Pro in this test): Receives slide descriptions, generates images, builds templates (1920x1080), and returns results via a submit_slide tool call.

The creative agent has tools including generate_image, search_images, and submit_slide. The submit_slide call is critical—it returns a 'submit' signal, terminates the agent loop, and extracts slide data. Both agents run through the same loop with streaming, parallel tool execution, and iteration limits.

Strengths: Design and Aesthetic Output

When Gemini 3.1 Pro works correctly, it produces superior design output compared to other models tested (Claude Sonnet 4.6 and GPT-5.2). Specific strengths include:

  • Aesthetic intuition: Better color theory and visual hierarchy.
  • Layout creativity: Experiments with asymmetric compositions, overlapping elements, and modern UI styles like dark-mode/glassmorphism.
  • Vibe interpretation: Effectively handles vague prompts like "make it feel premium" or "tech startup vibes."
  • Code quality: Generates modern, structural HTML/CSS.
Ad

Critical Problems in Production

The team encountered two major reliability issues with Gemini 3.1 Pro in their agentic pipeline:

1. ~20% Tool-Call Failure Rate

In approximately 20% of requests, Gemini 3.1 Pro fails to call the required submit_slide tool. Instead, it exhibits several failure patterns:

  • Outputs raw HTML template as plain text, describing what it "would" create rather than triggering the tool.
  • Generates images correctly but stops without submitting, hitting iteration limits.
  • Calls image generation tools but writes natural language summaries ("Here is your beautiful slide...") instead of the final tool call.
  • Enters loops refining design descriptions in text without committing to action.

Since submit_slide is the hard exit path, failures result in no data returned to the orchestrator and failed user generations.

2. Garbled/Corrupted Output

The model frequently returns corrupted text in responses—random character sequences, broken Unicode, half-encoded strings. This corruption sometimes bleeds into slide content (variable values, template markup), meaning even successful submissions might display gibberish text in presentations.

Comparison with Other Models

  • Claude Sonnet 4.6: Near-zero failure rate on submit_slide calls in the same creative agent role, described as "boringly reliable" with no garbled output.
  • GPT-5.2: Moderate tool reliability between Gemini and Claude, but doesn't suffer from encoding/gibberish issues.

Attempted Mitigations

The team tried several approaches without significant improvement:

  • Adding aggressive explicit instructions in system prompts: "You MUST call submit_slide. Do not output the template as text."
  • Injecting few-shot examples showing exact expected tool-call patterns.
  • Reducing iteration limits to force faster convergence.
  • Stripping down and simplifying tool schemas.

Despite these issues, Gemini 3.1 Pro remains live in their system due to its superior design capabilities when it functions correctly.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Distillery: A Claude Code Plugin for Persistent Team Context
Tools

Distillery: A Claude Code Plugin for Persistent Team Context

Distillery is a plugin for Claude Code that provides teams with shared, persistent context across sessions and people. Version 0.2.0 adds hybrid search, auth audit logging, and uv support.

OpenClawRadar
Claude-Control: Mobile Remote Control for Claude Code Sessions
Tools

Claude-Control: Mobile Remote Control for Claude Code Sessions

Claude-control is an open-source tool that lets you manage Claude Code sessions from your phone via HTTPS and WebSocket. It runs Claude Code in a real PTY inside tmux, detects permission prompts, and sends push notifications with Allow/Deny buttons.

OpenClawRadar
MOOSE-Star: A 7B Model and 108K-Paper Dataset for Scientific Hypothesis Discovery – ICML 2026
Tools

MOOSE-Star: A 7B Model and 108K-Paper Dataset for Scientific Hypothesis Discovery – ICML 2026

MiroMind releases MOOSE-Star on Hugging Face: a 7B model (DeepSeek-R1-Distill-Qwen-7B fine-tune) for scientific hypothesis discovery, alongside the 108K-paper TOMATO-Star dataset. Benchmark shows MS-7B achieves 54.34% inspiration retrieval accuracy, beating GPT-5.4 and approaching Gemini-3 Pro.

OpenClawRadar
Self-updating translation system for OpenClaw maintains domain glossaries automatically
Tools

Self-updating translation system for OpenClaw maintains domain glossaries automatically

A Python script wraps the Kimi2.5 API to translate .srt files while preserving block indices, timestamps, and segmentation. The system uses project profiles with glossary.json, style.md, and memory.jsonl files, and includes a cron job that scans official sources every 6 hours to update terminology.

OpenClawRadar