Gemini 3.1 Pro in Multi-Agent Systems: High Design Quality, 20% Tool-Call Failure Rate

Architecture and Testing Context
The team behind Bobr, an AI presentation generator, tested Gemini 3.1 Pro within a two-level agent system. The architecture consists of:
- Orchestrator Agent: Handles conversation, understands user intent, plans structure, and dispatches work via tool calls.
- Creative Agent (Gemini 3.1 Pro in this test): Receives slide descriptions, generates images, builds templates (1920x1080), and returns results via a
submit_slidetool call.
The creative agent has tools including generate_image, search_images, and submit_slide. The submit_slide call is critical—it returns a 'submit' signal, terminates the agent loop, and extracts slide data. Both agents run through the same loop with streaming, parallel tool execution, and iteration limits.
Strengths: Design and Aesthetic Output
When Gemini 3.1 Pro works correctly, it produces superior design output compared to other models tested (Claude Sonnet 4.6 and GPT-5.2). Specific strengths include:
- Aesthetic intuition: Better color theory and visual hierarchy.
- Layout creativity: Experiments with asymmetric compositions, overlapping elements, and modern UI styles like dark-mode/glassmorphism.
- Vibe interpretation: Effectively handles vague prompts like "make it feel premium" or "tech startup vibes."
- Code quality: Generates modern, structural HTML/CSS.
Critical Problems in Production
The team encountered two major reliability issues with Gemini 3.1 Pro in their agentic pipeline:
1. ~20% Tool-Call Failure Rate
In approximately 20% of requests, Gemini 3.1 Pro fails to call the required submit_slide tool. Instead, it exhibits several failure patterns:
- Outputs raw HTML template as plain text, describing what it "would" create rather than triggering the tool.
- Generates images correctly but stops without submitting, hitting iteration limits.
- Calls image generation tools but writes natural language summaries ("Here is your beautiful slide...") instead of the final tool call.
- Enters loops refining design descriptions in text without committing to action.
Since submit_slide is the hard exit path, failures result in no data returned to the orchestrator and failed user generations.
2. Garbled/Corrupted Output
The model frequently returns corrupted text in responses—random character sequences, broken Unicode, half-encoded strings. This corruption sometimes bleeds into slide content (variable values, template markup), meaning even successful submissions might display gibberish text in presentations.
Comparison with Other Models
- Claude Sonnet 4.6: Near-zero failure rate on
submit_slidecalls in the same creative agent role, described as "boringly reliable" with no garbled output. - GPT-5.2: Moderate tool reliability between Gemini and Claude, but doesn't suffer from encoding/gibberish issues.
Attempted Mitigations
The team tried several approaches without significant improvement:
- Adding aggressive explicit instructions in system prompts: "You MUST call submit_slide. Do not output the template as text."
- Injecting few-shot examples showing exact expected tool-call patterns.
- Reducing iteration limits to force faster convergence.
- Stripping down and simplifying tool schemas.
Despite these issues, Gemini 3.1 Pro remains live in their system due to its superior design capabilities when it functions correctly.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Distillery: A Claude Code Plugin for Persistent Team Context
Distillery is a plugin for Claude Code that provides teams with shared, persistent context across sessions and people. Version 0.2.0 adds hybrid search, auth audit logging, and uv support.

Claude-Control: Mobile Remote Control for Claude Code Sessions
Claude-control is an open-source tool that lets you manage Claude Code sessions from your phone via HTTPS and WebSocket. It runs Claude Code in a real PTY inside tmux, detects permission prompts, and sends push notifications with Allow/Deny buttons.

MOOSE-Star: A 7B Model and 108K-Paper Dataset for Scientific Hypothesis Discovery – ICML 2026
MiroMind releases MOOSE-Star on Hugging Face: a 7B model (DeepSeek-R1-Distill-Qwen-7B fine-tune) for scientific hypothesis discovery, alongside the 108K-paper TOMATO-Star dataset. Benchmark shows MS-7B achieves 54.34% inspiration retrieval accuracy, beating GPT-5.4 and approaching Gemini-3 Pro.

Self-updating translation system for OpenClaw maintains domain glossaries automatically
A Python script wraps the Kimi2.5 API to translate .srt files while preserving block indices, timestamps, and segmentation. The system uses project profiles with glossary.json, style.md, and memory.jsonl files, and includes a cron job that scans official sources every 6 hours to update terminology.