RTX 5060 Ti 16GB Local LLM Benchmarks: 30B Models Still Lead for Coding

RTX 5060 Ti 16GB Local LLM Performance Findings
Testing on an RTX 5060 Ti 16GB with 32GB DDR4 RAM using llama-server b8373 (46dba9fce) reveals practical performance characteristics for local LLM coding workflows. The setup used llama.cpp with specific launch settings: fast path with fa=on, ngl=auto, threads=8, and KV settings -ctk q8_0 -ctv q8_0.
Model Performance Results
The benchmark compared multiple quantized models with these key findings:
- Best default coding model: Unsloth Qwen3-Coder-30B UD-Q3_K_XL
- Best higher-context coding option: Same Unsloth 30B model at 96k context
- Best fast 35B coding option: Unsloth Qwen3.5-35B UD-Q2_K_XL
Performance Metrics
Token generation speeds from local testing:
- Jackrong Qwen 3.5 4B Q5_K_M: 88 tok/s
- LuffyTheFox Qwen 3.5 9B Q4_K_M: 64 tok/s
- Jackrong Qwen 3.5 27B Q3_K_S: ~20 tok/s
- Unsloth Qwen 3.0 30B UD-Q3_K_XL: 76.3 tok/s
- Unsloth Qwen 3.5 35B UD-Q2_K_XL: 80.1 tok/s
Cross-Platform Comparison
Matched tests with 20 questions, 32k context, and max_tokens=800 showed:
- Unsloth Qwen3-Coder-30B UD-Q3_K_XL: Windows: 79.5 tok/s, quality 7.94 | Ubuntu: 76.3 tok/s, quality 8.14
- Unsloth Qwen3.5-35B UD-Q2_K_XL: Windows: 72.3 tok/s, quality 7.40 | Ubuntu: 80.1 tok/s, quality 7.39
- Jackrong Qwen3.5-27B Claude-Opus Distilled Q3_K_S: Windows: 19.9 tok/s, quality 8.85 | Ubuntu: ~20.0 tok/s, quality 8.21
Configuration Notes
The 30B coder path used: jinja, reasoning-budget 0, reasoning-format none. The 35B UD path used: c=262144, n-cpu-moe=8. For the 35B Q4_K_M stable tune, settings were: -ngl 26 -c 131072 --fit on --fit-ctx 131072 --fit-target 512M.
Notably, the 35B Q4_K_M model required specific tuning to run stable on this card but still didn't outperform the older UD-Q2_K_XL path in practical use. The author found that smaller models (9B route) and heavier experiments (35B Q4_K_M) weren't the strongest real-world picks despite expectations.
Ubuntu Performance Testing
Additional focused testing on Ubuntu with the Jackrong 27B model showed minimal variation:
-fa on, auto parallel: 19.95 tok/s-fa auto, auto parallel: 19.56 tok/s-fa on,--parallel 1: 19.26 tok/s
Flash-attention settings and parallel processing parameters had negligible impact on this particular model's performance.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Detecting Silent Tool Failures in AI Coding Agents with Vibeyard
Vibeyard is a tool that detects when AI coding agents experience silent tool failures—where agents fall back to alternative strategies without alerting developers—and surfaces these inefficiencies during sessions. It can suggest fixes to prevent repeated inefficient workflows.

Octopoda: Open Source Memory Layer for Local AI Agents
Octopoda is an open source memory layer that gives local AI agents persistent memory between sessions, semantic search, loop detection, and crash recovery. It runs fully offline with a 33MB embedding model and integrates with LangChain, CrewAI, AutoGen, and OpenAI Agents SDK.

singularity-claude: A Self-Evolving Skill Engine for Claude Code
singularity-claude is an open-source Claude Code plugin that adds a recursive evolution loop to prevent skill rot. It scores skill executions, auto-repairs low-scoring skills, crystallizes high-performing versions, and detects capability gaps.

LLMSpend: Open-source cost tracker for Anthropic and OpenAI SDKs
LLMSpend is a Python library that adds cost tracking to Anthropic and OpenAI SDK calls with two lines of code. It provides local SQLite storage, CLI reporting, and a web dashboard without sending data externally.