RTX 5060 Ti 16GB Local LLM Benchmarks: 30B Models Still Lead for Coding

✍️ OpenClawRadar📅 Published: April 19, 2026🔗 Source

RTX 5060 Ti 16GB Local LLM Performance Findings

Testing on an RTX 5060 Ti 16GB with 32GB DDR4 RAM using llama-server b8373 (46dba9fce) reveals practical performance characteristics for local LLM coding workflows. The setup used llama.cpp with specific launch settings: fast path with fa=on, ngl=auto, threads=8, and KV settings -ctk q8_0 -ctv q8_0.

Model Performance Results

The benchmark compared multiple quantized models with these key findings:

Best default coding model: Unsloth Qwen3-Coder-30B UD-Q3_K_XL
Best higher-context coding option: Same Unsloth 30B model at 96k context
Best fast 35B coding option: Unsloth Qwen3.5-35B UD-Q2_K_XL

Performance Metrics

Token generation speeds from local testing:

Jackrong Qwen 3.5 4B Q5_K_M: 88 tok/s
LuffyTheFox Qwen 3.5 9B Q4_K_M: 64 tok/s
Jackrong Qwen 3.5 27B Q3_K_S: ~20 tok/s
Unsloth Qwen 3.0 30B UD-Q3_K_XL: 76.3 tok/s
Unsloth Qwen 3.5 35B UD-Q2_K_XL: 80.1 tok/s

Cross-Platform Comparison

Matched tests with 20 questions, 32k context, and max_tokens=800 showed:

Unsloth Qwen3-Coder-30B UD-Q3_K_XL: Windows: 79.5 tok/s, quality 7.94 | Ubuntu: 76.3 tok/s, quality 8.14
Unsloth Qwen3.5-35B UD-Q2_K_XL: Windows: 72.3 tok/s, quality 7.40 | Ubuntu: 80.1 tok/s, quality 7.39
Jackrong Qwen3.5-27B Claude-Opus Distilled Q3_K_S: Windows: 19.9 tok/s, quality 8.85 | Ubuntu: ~20.0 tok/s, quality 8.21

Configuration Notes

The 30B coder path used: jinja, reasoning-budget 0, reasoning-format none. The 35B UD path used: c=262144, n-cpu-moe=8. For the 35B Q4_K_M stable tune, settings were: -ngl 26 -c 131072 --fit on --fit-ctx 131072 --fit-target 512M.

Notably, the 35B Q4_K_M model required specific tuning to run stable on this card but still didn't outperform the older UD-Q2_K_XL path in practical use. The author found that smaller models (9B route) and heavier experiments (35B Q4_K_M) weren't the strongest real-world picks despite expectations.

Ubuntu Performance Testing

Additional focused testing on Ubuntu with the Jackrong 27B model showed minimal variation:

-fa on, auto parallel: 19.95 tok/s
-fa auto, auto parallel: 19.56 tok/s
-fa on, --parallel 1: 19.26 tok/s

Flash-attention settings and parallel processing parameters had negligible impact on this particular model's performance.

📖 Read the full source: r/LocalLLaMA

👀 See Also

Tools

VibeAround: Local Daemon Connects Coding Agents to Telegram and Discord

VibeAround is a local daemon that connects coding agents like Claude Code, Gemini CLI, and Codex to IM platforms including Telegram and Discord. The tool features session handover with pickup codes to continue conversations across devices.

Apr 21, 2026, 12:21 PM UTC

OpenClawRadar

Tools

Claude Code Studio: Open-Source Desktop App for Managing Multiple Claude Coding Sessions

Claude Code Studio v0.9.3 is an open-source desktop application that provides a multi-pane interface for managing multiple Claude Code CLI sessions. It addresses common workflow issues like juggling terminal tabs, session persistence, and instruction repetition.

Apr 15, 2026, 09:45 PM UTC

OpenClawRadar

Tools

Claude Design vs Huashu-Design: A Head-to-Head on HTML Layouts and Rate Limits

Claude Design builds HTML prototypes fast but hits rate limits quickly. Huashu-Design, an open-source Claude Code skill, runs on the normal subscription with no separate rate limit—but takes 20 minutes vs 5.

Apr 29, 2026, 04:20 PM UTC

OpenClawRadar

Tools

Claude Code user builds nvm plugin to capture problem-solving context

A developer created a Claude plugin called nvm (non-volatile memory) that converts Claude session history into markdown cards documenting problem-solving decisions and reusable insights. The tool addresses the issue of losing track of how problems were solved when using AI coding assistants.

Apr 15, 2026, 11:45 PM UTC

OpenClawRadar