Claude Code Used to Simulate 4,000+ Blind Werewolf Games with LLMs

Simulation Setup and Results
A developer built a small simulator using Claude Code where large language models play blind one-night Werewolf against each other. The experiment ran approximately 4,600 games across models from OpenAI (GPT-4o-mini, GPT-5-mini) and xAI (Grok-3-fast, Grok-4-1-fast).
The game variant has minimal signals: 7 players, 1 wolf, no roles, one short discussion, then a simultaneous vote. The only differentiating factor between players is their name. Despite this limited setup, the simulation revealed consistent patterns where some names get voted out significantly more often than others across every model tested, while other names almost never get voted out.
Important Caveats and Access
The developer explicitly states this isn't a causal claim — just an outcome pattern from a toy setup. The name groups are broad, some names appear less frequently, and there are multiple ways this could be an artifact of the setup rather than revealing anything fundamental about the models. However, the consistency of these patterns across runs and models was noted as surprising.
For those interested in exploring further:
- Dashboard: https://huggingface.co/spaces/Queue-Bit-1/llm-bias-dashboard
- Code + raw logs: https://github.com/Queue-Bit-1/wolf
The developer is curious if others have observed similar name effects in multi-agent simulations.
📖 Read the full source: r/ClaudeAI
👀 See Also

LoreConvo: MCP Server Adds Persistent Session Memory to Claude Code
LoreConvo is an MCP server that provides Claude Code with persistent session memory, automatically saving and loading context between sessions. It saves 3,000-8,000 tokens per session by eliminating re-contexting overhead.

OpenClaw Model Performance Review: Codex 5.3 Leads, GLM Models Disappoint
A developer tested multiple AI models with OpenClaw, finding Codex 5.3 performs best with 9/10 rating, while GLM 4.7 and GLM 5 scored 5/10 due to high token usage, slow responses, and inconsistent output.
Claude Code vs Codex: 36 vs 28 files, $2.50 vs $2.04, infinite loop caught — real-world comparison
A developer runs the same two tasks on Claude Code and Codex (Cursor): PR triage bot and real-time code review UI. Results: 36 vs 28 files, $2.50 vs $2.04 cost, Claude produced fewer TypeScript errors, Codex had an infinite React loop.

NPCterm: Full PTY Terminal Emulator for AI Agents via MCP
NPCterm provides AI agents with full terminal access through a headless, in-memory PTY terminal emulator exposed via MCP. It includes 15 MCP tools for terminal control, process state detection, and support for TUI applications.