Running a 6-agent behavioral coaching pipeline on self-hosted Qwen3 235B with vLLM

Multi-agent behavioral coaching system
A developer has implemented a 6-agent cognitive pipeline for behavioral coaching that runs entirely on self-hosted Qwen3 models via vLLM. The system uses Claude Code instances as agents calling a vLLM endpoint, with four specialist agents firing simultaneously on each user message.
Hardware and setup
- Development: Qwen3 30B on 2x RTX 4090s
- Production: Qwen3 235B on RunPod A40 pods
- All 6 agents are Claude Code instances calling the vLLM endpoint
Pipeline architecture
Each user message triggers 6 agents in sequence:
- Shadow - Runs first, writes cross-session behavioral patterns to a shared blackboard (stated goals vs revealed priorities, follow-through prediction, pattern classification)
- Persona - OCEAN scoring, recurring goal detection, follow-through prediction percentages, growth edge identification
- Plasticity - Personality-informed coaching strategy, maps OCEAN scores to communication preferences
- Stability - Risk framework with severity/detectability/reversibility ratings, identifies blocked moves the coach should not suggest
- Coach - Fires early for an immediate response while the other agents process (~seconds)
- Synth (Pineal) - Merges all worker outputs, applies voice calibration, delivers the full response
Performance characteristics
The user sees an immediate Coach response, then the full synthesis appends approximately 40 seconds later on 2x RTX 4090s. On the A40 configuration, this takes about 108 seconds - counterintuitively slower due to different memory architecture.
Key implementation insights
What worked:
- Parallel dispatch is the key unlock for performance
- Shadow must write first because synthesis needs the blackboard content to aggregate correctly
- The sequencing logic to guarantee Shadow completes before Synth picks up adds meaningful complexity but is non-negotiable
- Context management at 235B scale is expensive - each agent gets a full context brief plus session history
- Aggressive compaction between sessions and tight per-agent context budgets have been the main reliability levers
What is hard:
- Getting agents to write structured output reliably enough for synthesis to aggregate without hallucinating merge artifacts
- Main failure mode: Synth seeing conflicting signals from Persona and Stability on the same session
The developer is seeking input from others running multi-agent systems on self-hosted inference, particularly regarding parallelism strategies at 235B scale.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw Orchestrator Routing Issues: When Delegation Fails
A developer reports their OpenClaw main orchestrator incorrectly handles requests itself about 40-50% of the time instead of routing to specialist sub-agents, despite using an explicit routing table and delegation rules. The setup includes 7 specialist agents for services like Gmail, Todoist, Notion, and weather.

Debugging a Pi Zero 2W BadUSB with Claude Code: Fixing an 'Impossible' Bug
A developer rebuilt a Pi Zero 2W BadUSB toolkit with Claude Code, which diagnosed a wrong-signal bug, empirically confirmed hardware limitations, and fixed a silent Python no-op in under 4 hours.

Using Local LLM to Monitor Minecraft Bot AFK Sessions
A developer used a local LLM to monitor their Minecraft bot running Baritone for mining jobs, setting up screen monitoring to receive alerts when the bot dies or disconnects from the server.

OpenClaw User Switches to RunLobster for Managed Infrastructure
A developer spent 4 months troubleshooting OpenClaw issues including agent stalling, config breaks, and unpredictable API costs before switching to RunLobster. The same models and framework worked reliably with multi-step task completion and faster integrations.