Fine-tuning Phi-4-mini by training only LayerNorm parameters fails to improve performance

Experimental setup and methodology
The experiment tested fine-tuning Phi-4-mini-instruct (3.8B, 32 layers) by training only LayerNorm parameters, calling the approach BALLAST. The model was run on a Mac Studio M3 Ultra 256GB using MLX via mlx_lm's built-in train() function with 97% GPU utilization. Self-hosted W&B was used for tracking.
Important note: Phi-4-mini uses RMSNorm, not full LayerNorm - only γ values, no bias. The author acknowledges that published papers showing positive results used models with both γ and β parameters, which likely matters more than initially realized.
Benchmark results
Baseline scores for vanilla Phi-4-mini (no training):
- HumanEval pass@1: 0.646
- MBPP pass@1: 0.558
- MMLU acc: 0.667
- ARC-Challenge acc_norm: 0.595
- HellaSwag acc_norm: 0.728
- MedQA acc: 0.545
- GSM8K exact_match: 0.813
Experiment 1: Python domain
Trained on 10K files from The Stack with LR=5e-5 for 3 epochs:
- BALLAST (196K params): Loss 1.39, HumanEval 0.616 (-0.030), MBPP 0.526 (-0.032)
- LoRA-Match (180K params): Loss 1.30, HumanEval 0.634 (-0.012), MBPP 0.536 (-0.022)
- LoRA-Std (11.5M params): Loss 1.07, HumanEval 0.439 (-0.207), MBPP 0.372 (-0.186)
LoRA-Standard showed classic overfitting - 11.5M parameters memorized 10K files instead of learning generalizable patterns. Additional testing with LR=1e-4 for BALLAST showed loss dropping to 1.31 then climbing back above 1.44 by iteration 2300.
Experiment 2: Medical raw text
Trained on 10K PubMed abstracts with LR=5e-5 for 3 epochs:
- BALLAST: MedQA 0.528 (-0.017)
- LoRA-Match: MedQA 0.546 (+0.001)
- LoRA-Std: MedQA 0.465 (-0.080)
The author notes a rookie mistake: training on raw PubMed abstracts as next-token prediction doesn't help with MedQA, which tests clinical reasoning through multiple choice vignettes.
Experiment 3: Medical instruction QA
Fixed data format using 10K MedMCQA questions with LR=1e-5 for 3 epochs. Format: "Question: ... A) X B) Y C) Z D) W Answer: B"
- BALLAST: MedQA 0.538 (-0.007)
Learning rate testing summary
- LR=1e-4 on Python: Overshot, loss diverged by iteration 2300
- LR=5e-5 on Python: Flat, slight degradation on benchmarks
- LR=5e-5 on Medical (raw text): Flat, slight degradation on MedQA
- LR=1e-5 on Medical (instruction QA): Flat, slight degradation on MedQA
Key findings
Training only LayerNorm γ values doesn't improve performance on any benchmark tested - not on Python, not on medical QA, not at any learning rate. The author concludes that transformers already route information dynamically through attention, so there's no point in trying to use LayerNorm as an additional relational directionality layer. The experiment used only 196K trainable parameters (0.005% of model) compared to LoRA's 11.5M parameters in Phi-4-mini.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw Agent System Broken After Recent Updates
Recent OpenClaw updates have broken core agent functionality, with users reporting that agents can't be reliably created or run. The system previously allowed creating agents, having them appear correctly, running workflows, and using them for real tasks.

YouTube Auto-Labels AI Videos: Simplified Labels & Auto-Detection in 2026
YouTube updates AI labels: more prominent placement, auto-detection of photorealistic AI content, and permanent labels for videos made with YouTube's own AI tools or C2PA metadata.

Claude Code v2.1.119: Config Persistence, GitLab/Bitbucket PR Support, and Dozens of Bugfixes
Claude Code v2.1.119 persists /config settings to ~/.claude/settings.json, adds --from-pr support for GitLab MRs and Bitbucket PRs, and fixes over 25 bugs including CRLF paste, MCP OAuth, and auto-mode conflicts.

Claude Prompt Cache Diagnostics: Stats Thread Reveals 98.9% Cache Read Ratio
Two days ago, Claude released prompt cache diagnostics in Console. One developer reports 98.9% cache read ratio, with 80% of misses due to messages changed.