Qwen 3.6 27B at 52.8 tps TG on AMD MI50s: Full Precision, No MTP, No Quant

A Reddit user has published benchmark results for running Qwen3.6-27B (full precision, no quantization) on eight AMD MI50s (2018 GPUs) using a custom vllm fork. The system achieves 52.8 tokens per second (tps) for text generation and 1569 tps for prompt processing with TP8, no MTP, and no flash attention optimizations that might slow down large prompts.
Key Details
- Hardware: 8x AMD MI50s, PCIe (no PCIe switch used yet)
- Engine: vllm fork v0.20.1 with ROCm 7.2.1 – github.com/ai-infos/vllm-gfx906-mobydick
- Model:
Qwen/Qwen3.6-27B(HuggingFace full precision FP16) - Quantization: None – full FP16 precision
- MTP: Disabled (slower for large prompts)
- Flash attention: Not used (triton-based AMD flash attention also slower for big prompts)
- Prompt: Single inference with 1K and 15K token prompts (bench used 10K input, 1K output)
Benchmark Results
Successful requests: 4 Total input tokens: 40000 Total generated tokens: 4000 Output token throughput (tok/s): 32.91 Peak output token throughput (tok/s): 56.00 Total token throughput (tok/s): 362.03 Mean TTFT (ms): 32874.56 Mean TPOT (ms): 88.66 Mean ITL (ms): 88.66
Note: The user reports 52.8 tps TG for single inference with 15K prompt; the benchmark shows aggregate results over 4 requests at 10K input each. With TP2, the model also fits and runs at ~34 tps TG.
Setup Commands (Docker + vllm serve)
docker run -it --name vllm-gfx906-mobydick \
-v /llm:/llm --network host \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add $(getent group render | cut -d: -f3) \
--ipc=host \
aiinfos/vllm-gfx906-mobydick:v0.20.1rc0.x-rocm7.2.1-pytorch2.11.0 \
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm serve \
/llm/models/Qwen3.6-27B \
--served-model-name Qwen3.6-27B \
--dtype float16 \
--max-model-len auto \
--max-num-batched-tokens 8192 \
--block-size 64 \
--gpu-memory-utilization 0.98 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--mm-processor-cache-gb 1 \
--limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 \
--skip-mm-profiling \
--default-chat-template-kwargs '{"min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \
--tensor-parallel-size 8 \
--host 0.0.0.0 --port 8000 2>&1 | tee log.txt
Who It's For
Developers running agentic coding tools (e.g., Claude Code, Hermes) on AMD hardware, especially with large prompts and full-precision requirements.
The user notes that further improvements are possible with PCIe switches (lower latency), more optimized flash attention/MTP for ROCm/gfx906, and updated software stacks.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenAI Training Costs Projected to Exceed Anthropic's by 4-5 Times Annually
According to confidential financials reported by the Wall Street Journal, OpenAI expects to spend 4-5 times more on training than Anthropic each year for the next five years. The expense scale is described as mind-boggling.

Local LLM Benchmark: Backend Generation by Function Calling – GLM, Qwen, DeepSeek Compared
A rigorous benchmark of local and frontier LLMs for backend code generation via function calling, with scoring rubric. Key findings: qwen3.5-35b-a3b matches gpt-5.4 on DB/API design, and dense Qwen 27B beats 397B MoE. Frontier models dropped due to cost.

Adaptive Inference Routing Proposal for AI Query Efficiency
A proposal submitted to Anthropic in April 2026 outlines a five-step system for routing queries to appropriate AI models based on complexity scoring, using simple signals like character count and sentence count before any model inference occurs.

Claude outperforms Gemini, ChatGPT, and Grok in real-time Python coding challenge
A developer tested Claude, Gemini, ChatGPT, and Grok in a real-time Python coding tournament where AI-generated bots competed to find words on a 15×15 letter grid. Claude won decisively.