Running 6-Agent Coaching Pipeline on Qwen3 235B with vLLM

Multi-agent behavioral coaching system

A developer has implemented a 6-agent cognitive pipeline for behavioral coaching that runs entirely on self-hosted Qwen3 models via vLLM. The system uses Claude Code instances as agents calling a vLLM endpoint, with four specialist agents firing simultaneously on each user message.

Hardware and setup

Development: Qwen3 30B on 2x RTX 4090s
Production: Qwen3 235B on RunPod A40 pods
All 6 agents are Claude Code instances calling the vLLM endpoint

Pipeline architecture

Each user message triggers 6 agents in sequence:

Shadow - Runs first, writes cross-session behavioral patterns to a shared blackboard (stated goals vs revealed priorities, follow-through prediction, pattern classification)
Persona - OCEAN scoring, recurring goal detection, follow-through prediction percentages, growth edge identification
Plasticity - Personality-informed coaching strategy, maps OCEAN scores to communication preferences
Stability - Risk framework with severity/detectability/reversibility ratings, identifies blocked moves the coach should not suggest
Coach - Fires early for an immediate response while the other agents process (~seconds)
Synth (Pineal) - Merges all worker outputs, applies voice calibration, delivers the full response

Performance characteristics

The user sees an immediate Coach response, then the full synthesis appends approximately 40 seconds later on 2x RTX 4090s. On the A40 configuration, this takes about 108 seconds - counterintuitively slower due to different memory architecture.

Key implementation insights

What worked:

Parallel dispatch is the key unlock for performance
Shadow must write first because synthesis needs the blackboard content to aggregate correctly
The sequencing logic to guarantee Shadow completes before Synth picks up adds meaningful complexity but is non-negotiable
Context management at 235B scale is expensive - each agent gets a full context brief plus session history
Aggressive compaction between sessions and tight per-agent context budgets have been the main reliability levers

What is hard:

Getting agents to write structured output reliably enough for synthesis to aggregate without hallucinating merge artifacts
Main failure mode: Synth seeing conflicting signals from Persona and Stability on the same session

The developer is seeking input from others running multi-agent systems on self-hosted inference, particularly regarding parallelism strategies at 235B scale.

📖 Read the full source: r/LocalLLaMA