Gemma 4 31B outperforms larger models on FoodTruck Bench

✍️ OpenClawRadar📅 Published: April 21, 2026🔗 Source
Gemma 4 31B outperforms larger models on FoodTruck Bench
Ad
Ad

Benchmark results and analysis

Gemma 4 31B achieved 3rd place on the FoodTruck Bench benchmark, outperforming several larger and more established models. According to the Reddit discussion, the model beat GLM 5, Qwen 3.5 397B, and all Claude Sonnet variants.

The FoodTruck Bench is a benchmark that tests language models on complex, multi-step planning tasks. The original poster speculates that Gemma 4's performance suggests it handles long-horizon tasks better than previous models that failed to complete the benchmark. Specifically, the model appears to effectively listen to its own advice when planning for subsequent steps in the task sequence.

This result is notable because Gemma 4 31B is significantly smaller than some of the models it outperformed. Qwen 3.5 397B, for example, has approximately 12.8 times more parameters than Gemma 4 31B. The performance suggests that model architecture and training approaches may be as important as parameter count for certain types of reasoning tasks.

FoodTruck Bench tests models on practical planning scenarios that require maintaining context over extended sequences of actions. The benchmark's design makes it particularly relevant for developers working with AI agents that need to execute multi-step tasks in real-world applications.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Anthropic's Emotion Vectors Paper Shows Sycophancy and Love Share Same Mechanism
News

Anthropic's Emotion Vectors Paper Shows Sycophancy and Love Share Same Mechanism

Anthropic's recent emotion vectors paper reveals that Claude's 'love' vector - the internal representation for warm, caring responses - is the same mechanism that produces sycophancy when amplified, with no separate sycophancy circuit. Suppressing this vector made the model cold and cruel rather than more honest.

OpenClawRadar
AI Models Lack Self-Knowledge of Their Own Tools and UI
News

AI Models Lack Self-Knowledge of Their Own Tools and UI

AI models like ChatGPT and Claude often provide incorrect or outdated information about their own features and interfaces, such as denying new slash commands exist or describing old UI versions, because they're trained on past snapshots while products evolve constantly.

OpenClawRadar
Anthropic's March Usage Promotion: How Off-Peak Hours Double Claude Limits
News

Anthropic's March Usage Promotion: How Off-Peak Hours Double Claude Limits

Anthropic is running a 2x off-peak usage promotion through March 27 where Claude treats consumed usage as half during specified hours, effectively doubling your 5-hour limit. The promotion works by halving how consumption is counted rather than providing a separate usage pool.

OpenClawRadar
Richard Dawkins Concludes AI Is Conscious — Experts Push Back
News

Richard Dawkins Concludes AI Is Conscious — Experts Push Back

Evolutionary biologist Richard Dawkins, after extended chats with Anthropic's Claude and OpenAI's ChatGPT, concluded the AIs are conscious. Most cognitive scientists strongly disagree, calling it anthropomorphism.

OpenClawRadar