EsoLang-Bench: A Coding Benchmark Using Esoteric Languages to Test LLM Reasoning

EsoLang-Bench is a new coding benchmark designed to test whether large language models can genuinely reason through problems or are simply pattern-matching against training data. The benchmark uses esoteric programming languages with minimal training data presence.
Benchmark Design
The benchmark uses five esoteric programming languages: Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare. These languages were chosen because they have almost zero training data in typical pretraining pipelines. The benchmark contains the same algorithmic problems as HumanEval across the same difficulty range, just translated to these esoteric languages.
Testing Methodology
Researchers tested five models: GPT-5.2, O4-mini, Gemini 3 Pro, Qwen3-235B, and Kimi K2. They used five prompting strategies including:
- Self-scaffolding
- Coder-critic pairs
- ReAct pipeline
Results
The best single result was 11.2% on Befunge-98 with self-scaffolding. Medium, Hard, and Extra-Hard difficulty problems stayed at 0% across all models, languages, and strategies. Few-shot prompting gave only +0.8 percentage points on average, which researchers describe as statistically indistinguishable from noise.
Agentic systems like Claude Code and Codex performed 2-3x better than non-agentic approaches, but this improvement came primarily from sharper feedback loops and context management rather than evidence of actual reasoning transfer.
Error Analysis
The error breakdown reveals interesting patterns:
- On Brainfuck (which has some online presence), models could produce valid syntax but failed on logic
- On Whitespace (which has almost no training data), models couldn't even produce valid programs at all
This shows a clear gap between models' performance on languages with some pretraining data versus those with basically none.
Purpose and Availability
The benchmark aims to create evaluations where high scores are actually hard to fake, moving beyond just harder problems in mainstream languages like Python. The researchers suggest this approach creates evaluations where the economic incentive to game the benchmark doesn't exist, and the only route to good performance is genuine learning to generalize.
EsoLang-Bench is available as a template for others to build upon, whether through new languages, new problem types, or entirely different out-of-distribution domains.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw Shared Memory Plugin: SQLite-Based Multi-Agent Coordination
A developer built a plugin for OpenClaw multi-agent setups that enables agents to share memory using SQLite, eliminating the need for external services. The plugin allows explicit memory sharing via a tool, automatic context extraction, access control, entity tracking, and contradiction detection.

Claude Skill Enables Granular Personality Adjustments with Quantified Variables
A new Claude skill allows developers to make quantified adjustments across 32 groups of personality traits covering 120 Claude-defined variables, with group-level profiles showing metrics like Wordiness (60), Agreeableness (55), and Sarcasm & Edge (17). The skill persists across conversations and includes a publish command for custom instructions.

cortex-engine MCP server adds persistent memory and multi-agent support
cortex-engine v0.4.0 is an open-source MCP server that gives AI agents persistent long-term memory with tools like observe(), query(), believe(), and dream(). It now supports multiple agents with isolated memory namespaces.

Claude Code hooks prevent Chrome tab interference between multiple sessions
A developer created three hooks (session-start, capture-tab-id, enforce-tab-id) that pin each Claude Code session to its own Chrome tab, preventing sessions from accidentally accessing other sessions' tabs during test runs and form fills.