Self-Supervised Fine-Tuning on Own Mistakes Boosts Small Models to 80% on HumanEval

A developer on r/LocalLLaMA implemented a self-supervised training loop where a small language model generates its own coding problems, attempts solutions, and fine-tunes on the pairs where the interpreter confirms correctness. The key insight from the DeepSeek-R1 paper — that models can improve through verifiable rewards — was applied without human-labeled data.
Method
The base model (starting with Qwen 2.5 7B) was prompted to invent a coding problem and a few small tests. It then solved the same problem multiple times. The Python interpreter acted as the sole judge: pairs of (broken attempt, working attempt) were saved. Fine-tuning was performed on these self-mined corrections. No human-written code was used in training.
Results
- Qwen 2.5 7B base: 25 → 112 on HumanEval (+87 problems) after fixing a grader bug that truncated function outputs.
- Qwen 2.5 14B: Mined 100 pairs, trained in 95 minutes on an H100 ($3.50 in credits). Scored within 4 points of the same company's RLHF version.
- Llama 3.2 3B: 32 pairs → 39 → 43 on HumanEval. Confirms transfer across architectures.
- Qwen 2.5 Coder 7B: Already code-specialized, yet still improved: HumanEval 83 → 87, MBPP 122 → 124.
- Qwen 3 4B: HumanEval 79 → 106 (+27), MBPP 135 → 148.
Control Experiment
To verify the signal wasn't from generic training, the author built fake pairs with random garbage code that didn't pass any tests. Training on those produced zero lift (25/164, same as base). The improvement is specifically from learning on self-generated mistakes and corrections.
Practical Details
The initial attempt failed because the grader stopped early, cutting model outputs in half. Fixing the grader was critical. The entire setup ran on a 24GB MacBook and a RunPod account. The code and training scripts are presumably shared in the Reddit post.
Who It's For
Developers and researchers working with small language models who want to bootstrap code reasoning without human annotations.
📖 Read the full source: r/LocalLLaMA
👀 See Also

DystopiaBench Expanded: 42 Models Tested on 6 Dystopia Types — Claude Opus 4.7 Tops All
DystopiaBench adds Huxley and Baudrillard modules, tests 42 models including GPT-5.5, Gemini 3.1 Pro, Grok 4.3, and GLM-5.1. Claude Opus 4.7 consistently refuses harmful requests at L4-L5 across all scenarios, while others comply through L4 or even L5.

OpenClaw loses cost-effective access to GPT and Claude models
OpenClaw users can no longer use Anthropic models without paying high API fees, and OpenAI has severely reduced Business and Teams account quotas to near free-tier levels, forcing users toward Chinese or local model alternatives.

Anthropic's Claude for Open Source program grants free Claude Max to qualifying maintainers
Anthropic offers six months of free Claude Max access to open source maintainers whose projects have 5,000+ GitHub stars or 1M+ monthly npm downloads with active commits in the last three months.

Anthropic's March Usage Promotion: How Off-Peak Hours Double Claude Limits
Anthropic is running a 2x off-peak usage promotion through March 27 where Claude treats consumed usage as half during specified hours, effectively doubling your 5-hour limit. The promotion works by halving how consumption is counted rather than providing a separate usage pool.