When RLVR Helps Small Fine-Tuned Models: A 12-Dataset Analysis

A recent experiment tested whether adding a reinforcement learning stage (RLVR) on top of supervised fine-tuning (SFT) for small language models (1.7B parameters) provides measurable benefits. The team ran a controlled experiment across 12 datasets to determine exactly when this approach helps and when it doesn't.
Key Findings
The results split cleanly by task type:
- Text generation tasks (QA, documentation, PII redaction): +2.0 percentage points average improvement. Every single dataset in this category showed improvement.
- Structured tasks (classification, function calling): -0.7 percentage points average. Two datasets in this category actually regressed.
Why This Pattern Emerges
The researchers explain that once a fine-tuned model already gets most structured outputs correct, GRPO (Group Relative Policy Optimization) produces near-zero gradients. Essentially, there's no learning signal left for the reinforcement learning stage to work with.
For generative tasks, the output space is large enough that RL continues to find improvements that SFT misses — particularly when rewarding semantic correctness rather than exact string matching.
Practical Decision Rule
The study provides a simple guideline for developers:
- Classification or strict function calling → Use SFT only
- QA, documentation, extraction tasks → Add RLVR on top of SFT
The methodology, all 12 datasets tested, and raw numbers are available in the full analysis.
📖 Read the full source: r/LocalLLaMA
👀 See Also

What's missing in the 'agentic' story: a well-defined user agent role
Mark Nottingham argues that current AI agents lack a clear user agent role, creating a trust gap between what users expect and what agents actually do.

OpenAI to deploy AI models on U.S. Department of War classified network
OpenAI has reached a deal to deploy its AI models on the U.S. Department of War's classified network, with implementation scheduled for 2026. The Reuters article generated 15 points and 6 comments on Hacker News.

OpenClaw 3.31 Update Resets Agent Permissions and Settings
OpenClaw update 3.31 automatically disabled all agent tools, computer access permissions, and sub-agents, requiring manual re-enabling in Settings. The update also changed how permission requests work, no longer prompting for approval during use.

Qwen3.6 27B FP8 Runs 200k Tokens BF16 KV Cache at 80 TPS on RTX 5000 PRO 48GB
A Reddit user shares a vLLM setup for Qwen3.6 27B FP8 with BF16 KV cache at 200k tokens, achieving 60-90 TPS on a single RTX 5000 PRO 48GB. Full environment variables, config, and benchmark results are provided.