Bullshit Benchmark Tests LLM Nonsense Resistance: Results

What the Bullshit Benchmark Measures

The Bullshit Benchmark is a tool for testing whether large language models (LLMs) identify and push back on nonsensical prompts rather than confidently answering them. It measures how much a model is willing to go along with obvious nonsense, addressing concerns that models might self-induce hallucinations by trying to be helpful instead of calling out problematic prompts.

Key Benchmark Results

According to the source material, Claude models show significantly better performance than Gemini models in detecting nonsense. The results support the intuition that Claude models are better at this specific capability.

One example from the benchmark shows Claude successfully identifying a nonsense question while Gemini failed. Specifically, Gemini 3.1 Pro failed to detect an obvious nonsense question even with high thinking effort enabled, instead generating a nonsense answer.

The source suggests Anthropic's post-training approach contributes to Claude's better performance, noting that LLMs naturally tend toward superficial associative thinking that generates spurious relationships between concepts. Anthropic appears to have addressed this issue in their post-training pipeline.

Why This Matters for AI Coding Agents

For developers using AI coding assistants, a model's ability to recognize nonsense prompts is crucial. When models confidently answer nonsensical questions instead of pushing back, they can misguide users and generate incorrect code or explanations. This benchmark provides a concrete way to evaluate this specific safety behavior across different models.

You can view the complete benchmark results at https://petergpt.github.io/bullshit-benchmark/viewer/index.html.

📖 Read the full source: r/ClaudeAI