When RLVR Helps Small Fine-Tuned Models: A 12-Dataset Analysis

A recent experiment tested whether adding a reinforcement learning stage (RLVR) on top of supervised fine-tuning (SFT) for small language models (1.7B parameters) provides measurable benefits. The team ran a controlled experiment across 12 datasets to determine exactly when this approach helps and when it doesn't.
Key Findings
The results split cleanly by task type:
- Text generation tasks (QA, documentation, PII redaction): +2.0 percentage points average improvement. Every single dataset in this category showed improvement.
- Structured tasks (classification, function calling): -0.7 percentage points average. Two datasets in this category actually regressed.
Why This Pattern Emerges
The researchers explain that once a fine-tuned model already gets most structured outputs correct, GRPO (Group Relative Policy Optimization) produces near-zero gradients. Essentially, there's no learning signal left for the reinforcement learning stage to work with.
For generative tasks, the output space is large enough that RL continues to find improvements that SFT misses — particularly when rewarding semantic correctness rather than exact string matching.
Practical Decision Rule
The study provides a simple guideline for developers:
- Classification or strict function calling → Use SFT only
- QA, documentation, extraction tasks → Add RLVR on top of SFT
The methodology, all 12 datasets tested, and raw numbers are available in the full analysis.
📖 Read the full source: r/LocalLLaMA
👀 See Also

AI tools need practical integration for small businesses, not just hype
The AI community focuses on technical debates while small business owners need existing tools integrated into their workflows to handle repetitive tasks like scheduling, follow-ups, and bookkeeping.

Anthropic API Billing Bug: Sonnet Model Charged at Opus Rates
A user discovered that the Anthropic API is incorrectly billing the claude-sonnet-4-6 model at Opus pricing rates, despite returning the correct model string. The bug was identified through analysis of raw event data showing a cost discrepancy.

Analysis of OpenClaw's Astroturfing Campaign and $CLAWD Token Pump
A Reddit investigation reveals OpenClaw's viral growth in late January was driven by a recursive astroturfing campaign using approximately 400 bot instances, which created hype to pump the $CLAWD token to a $16M market cap before it crashed 90%.

Palantir AI to be embedded across US military according to report
A report indicates the US military plans to embed Palantir's AI technology across all branches. The article generated 37 points and 24 comments on Hacker News.