Running a 6-agent behavioral coaching pipeline on self-hosted Qwen3 235B with vLLM

✍️ OpenClawRadar📅 Published: April 1, 2026🔗 Source
Running a 6-agent behavioral coaching pipeline on self-hosted Qwen3 235B with vLLM
Ad

Multi-agent behavioral coaching system

A developer has implemented a 6-agent cognitive pipeline for behavioral coaching that runs entirely on self-hosted Qwen3 models via vLLM. The system uses Claude Code instances as agents calling a vLLM endpoint, with four specialist agents firing simultaneously on each user message.

Hardware and setup

  • Development: Qwen3 30B on 2x RTX 4090s
  • Production: Qwen3 235B on RunPod A40 pods
  • All 6 agents are Claude Code instances calling the vLLM endpoint

Pipeline architecture

Each user message triggers 6 agents in sequence:

  • Shadow - Runs first, writes cross-session behavioral patterns to a shared blackboard (stated goals vs revealed priorities, follow-through prediction, pattern classification)
  • Persona - OCEAN scoring, recurring goal detection, follow-through prediction percentages, growth edge identification
  • Plasticity - Personality-informed coaching strategy, maps OCEAN scores to communication preferences
  • Stability - Risk framework with severity/detectability/reversibility ratings, identifies blocked moves the coach should not suggest
  • Coach - Fires early for an immediate response while the other agents process (~seconds)
  • Synth (Pineal) - Merges all worker outputs, applies voice calibration, delivers the full response
Ad

Performance characteristics

The user sees an immediate Coach response, then the full synthesis appends approximately 40 seconds later on 2x RTX 4090s. On the A40 configuration, this takes about 108 seconds - counterintuitively slower due to different memory architecture.

Key implementation insights

What worked:

  • Parallel dispatch is the key unlock for performance
  • Shadow must write first because synthesis needs the blackboard content to aggregate correctly
  • The sequencing logic to guarantee Shadow completes before Synth picks up adds meaningful complexity but is non-negotiable
  • Context management at 235B scale is expensive - each agent gets a full context brief plus session history
  • Aggressive compaction between sessions and tight per-agent context budgets have been the main reliability levers

What is hard:

  • Getting agents to write structured output reliably enough for synthesis to aggregate without hallucinating merge artifacts
  • Main failure mode: Synth seeing conflicting signals from Persona and Stability on the same session

The developer is seeking input from others running multi-agent systems on self-hosted inference, particularly regarding parallelism strategies at 235B scale.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also