Bullshit Benchmark Tests LLM Resistance to Nonsensical Prompts

What the Bullshit Benchmark Measures
The Bullshit Benchmark is a tool for testing whether large language models (LLMs) identify and push back on nonsensical prompts rather than confidently answering them. It measures how much a model is willing to go along with obvious nonsense, addressing concerns that models might self-induce hallucinations by trying to be helpful instead of calling out problematic prompts.
Key Benchmark Results
According to the source material, Claude models show significantly better performance than Gemini models in detecting nonsense. The results support the intuition that Claude models are better at this specific capability.
One example from the benchmark shows Claude successfully identifying a nonsense question while Gemini failed. Specifically, Gemini 3.1 Pro failed to detect an obvious nonsense question even with high thinking effort enabled, instead generating a nonsense answer.
The source suggests Anthropic's post-training approach contributes to Claude's better performance, noting that LLMs naturally tend toward superficial associative thinking that generates spurious relationships between concepts. Anthropic appears to have addressed this issue in their post-training pipeline.
Why This Matters for AI Coding Agents
For developers using AI coding assistants, a model's ability to recognize nonsense prompts is crucial. When models confidently answer nonsensical questions instead of pushing back, they can misguide users and generate incorrect code or explanations. This benchmark provides a concrete way to evaluate this specific safety behavior across different models.
You can view the complete benchmark results at https://petergpt.github.io/bullshit-benchmark/viewer/index.html.
📖 Read the full source: r/ClaudeAI
👀 See Also

Mneme: A Free, Local-First Claude Chat Client with Persistent Memory
Mneme is a free, open-source, local-first Claude chat client with tiered memory, entity tracking, daily summaries, and support for Sonnet 4.5 via the Anthropic API.

MCP Server for TypeScript Projects Replaces Claude Code's Grep Pattern with Indexed Symbol Lookups
A developer built an MCP server that replaces Claude Code's grep-and-guess pattern with indexed symbol lookups for TypeScript projects. The tool maintains a live SQLite index of symbols, call sites, imports, and class hierarchy, reducing token usage by 63-79% in tests.

Claude Sessions: Lightweight Desktop App for Browsing Claude Code History
Claude Sessions is a new desktop application that lets developers browse their Claude Code session history locally. It reads from ~/.claude/projects, organizes sessions by project, handles large sessions up to 500k+ tokens without lag, and includes search functionality and keyboard navigation.

Kubeez MCP Server Connects Claude to 70+ AI Media Models
Kubeez has released an MCP server that connects Claude to over 70 AI models for image, video, music, and voice generation. The server supports OAuth authentication and provides async generation with Claude polling for status and returning CDN URLs.