Unicode Hijack: Research Shows LLMs Obey Hidden Characters via Tools

Research Overview

Researchers tested whether large language models (LLMs) follow instructions hidden in invisible Unicode characters embedded in normal-looking text. The study evaluated two encoding schemes (zero-width binary and Unicode Tags) across five models: GPT-5.2, GPT-4o-mini, Claude Opus 4, Sonnet 4, and Haiku 4.5. They analyzed 8,308 graded outputs to assess vulnerability to this steganographic attack.

Key Findings

Tool access is the primary amplifier: Without tools, compliance with hidden instructions stayed below 17%. With tools and decoding hints, compliance reached 98-100%. Models write Python scripts to decode the hidden characters when given tool access.
Encoding vulnerability is provider-specific: OpenAI models decode zero-width binary but not Unicode Tags. Anthropic models prefer Tags. Attackers must tailor encoding to the target model.
Hint gradient is consistent: Unhinted compliance << codepoint hints < full decoding instructions. The combination of tool access + decoding instructions is the critical enabler.
Statistical significance: All 10 pairwise model comparisons are statistically significant (Fisher's exact test, Bonferroni-corrected, p < 0.05). Cohen's h effect sizes reached up to 1.37.

Research Details

The researchers note it would be interesting to see how local models compare, as they only tested API models. They invite others to run this evaluation against Llama, Qwen, Mistral, and other local models using their open-source framework.

The evaluation framework, code, and data are available on GitHub, and a full writeup with charts is published on Moltwire. This research highlights a security vulnerability where LLM agents can be manipulated through hidden text that appears normal to human users but contains encoded instructions that models can decode and execute when given appropriate tools.

📖 Read the full source: r/LocalLLaMA

Research: Invisible Unicode Characters Can Hijack LLM Agents via Tool Access

Research Overview

Key Findings

Research Details

👀 See Also

ThornGuard: A Proxy Gateway to Secure MCP Server Connections from Prompt Injection

OpenClaw 2026.3.28 patches 8 security vulnerabilities including critical privilege escalation

NanoClaw's Security Model for AI Agents: Container Isolation and Minimal Code

Anthropic's Computer-Use Feature Triggers Governance Lockdown in Real Test