Fine-tuned Qwen2.5-7B to 96% of Claude Haiku with $3 and Zero Human Labelers

✍️ OpenClawRadar📅 Published: June 11, 2026🔗 Source
Fine-tuned Qwen2.5-7B to 96% of Claude Haiku with $3 and Zero Human Labelers
Ad

A developer fine-tuned Qwen2.5-7B to achieve 96% of Claude Haiku's composite performance on a domain-specific decision-reasoning task — spending only ~$3 in API calls and using zero human labelers. The method, called DV-DPO (Decision-Validated Direct Preference Optimization), autonomously generates training signal by running a multi-voice adversarial council.

How DV-DPO Works

The pipeline runs a 3-voice council on each decision question, producing a synthesis. Then the two losing voices cross-examine the synthesis. If the synthesis is revised under this adversarial pressure, a DPO pair is formed: the post-revision version is the chosen response, and the pre-revision version is the rejected response. If the synthesis holds — no pair is created. This ensures only genuine reasoning errors produce training signal, not format preferences or sampling variance.

Ad

Results

  • 1,040 training pairs generated total (~$3 at Haiku rates)
  • Head-to-head vs Claude Haiku: Format 100%, Commits 100%, Context 89%, Composite 96%
  • Latency: 11s on T4 GPU (4-bit quantized) vs Haiku's 3s
  • Adversarial failure rate: 2% on 96 targeted questions

Autonomous Improvement Loop

The system now runs an automated cycle: failure_detector → auto_red_team → DPO pairs → retrain → redeploy → eval. Version 5 pairs are accumulating. The fine-tuned model is available as a GGUF file ready for Ollama.

Who This Is For

Developers building domain-specific reasoning agents who want to move from pay-per-call APIs to a local fine-tuned model without expensive human annotation.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also