Self-Supervised Fine-Tuning Boosts 7B Models to 80% HumanEval

A developer on r/LocalLLaMA implemented a self-supervised training loop where a small language model generates its own coding problems, attempts solutions, and fine-tunes on the pairs where the interpreter confirms correctness. The key insight from the DeepSeek-R1 paper — that models can improve through verifiable rewards — was applied without human-labeled data.

Method

The base model (starting with Qwen 2.5 7B) was prompted to invent a coding problem and a few small tests. It then solved the same problem multiple times. The Python interpreter acted as the sole judge: pairs of (broken attempt, working attempt) were saved. Fine-tuning was performed on these self-mined corrections. No human-written code was used in training.

Results

Qwen 2.5 7B base: 25 → 112 on HumanEval (+87 problems) after fixing a grader bug that truncated function outputs.
Qwen 2.5 14B: Mined 100 pairs, trained in 95 minutes on an H100 ($3.50 in credits). Scored within 4 points of the same company's RLHF version.
Llama 3.2 3B: 32 pairs → 39 → 43 on HumanEval. Confirms transfer across architectures.
Qwen 2.5 Coder 7B: Already code-specialized, yet still improved: HumanEval 83 → 87, MBPP 122 → 124.
Qwen 3 4B: HumanEval 79 → 106 (+27), MBPP 135 → 148.

Control Experiment

To verify the signal wasn't from generic training, the author built fake pairs with random garbage code that didn't pass any tests. Training on those produced zero lift (25/164, same as base). The improvement is specifically from learning on self-generated mistakes and corrections.

Practical Details

The initial attempt failed because the grader stopped early, cutting model outputs in half. Fixing the grader was critical. The entire setup ran on a 24GB MacBook and a RunPod account. The code and training scripts are presumably shared in the Reddit post.