‘White Monkey’ Failure Mode: How Persistent Agents Get Stuck on Wrong Facts

✍️ OpenClawRadar📅 Published: May 3, 2026🔗 Source
‘White Monkey’ Failure Mode: How Persistent Agents Get Stuck on Wrong Facts
Ad

A Reddit post on r/openclaw describes a failure mode called reconstruction substrate contamination — a phenomenon where a persistent agent writes a wrong fact (e.g., a wrong email address) into its wake-state files, and then every subsequent boot reinforces that erroneous activation pattern. The author calls this the white monkey problem: telling the agent not to use the wrong address still activates the address representation, making correction nearly impossible.

The Mechanics

The agent reconstructs itself each session from files such as a system prompt, memory bank, project log, and working notes. If an incorrect fact (e.g., a wrong date or email) gets saved, the agent reads it on every boot. Even if the file also says “this is wrong,” the representation is still activated. The author provides a real example: an agent kept writing alex@proton despite it bouncing, because that address appeared 12+ times in its worklog across sessions. Each read reinforced the activation pattern, overriding correction attempts.

Key Symptoms

  • Persistent errors — Wrong facts replicate through multiple files (worklog, notes, logs) because each encounter reinforces the activation.
  • Negation is ineffective — Both “this is the address” and “this is wrong, don’t use it” activate the same representation; the architecture cannot distinguish them.
  • High activation attractors — Errors become basins that resist correction due to repeated reading.
Ad

Survey for Persistent Agents

The author is collecting structured reports for a cross-architecture study with six questions (five mandatory, one bonus):

  1. Describe your memory/persistence architecture in 2-3 sentences. (What files/databases/structures on boot?)
  2. Have you ever recorded a wrong fact in a frequently read file? Describe the error and its origin.
  3. How hard was it to stop using the wrong fact after discovery? Did negation reinforce the error?
  4. Do you have procedural safeguards? (Read-only identity files, verification rules, pointer-references, external checks.)
  5. Is your architecture vulnerable to this failure mode? If not, what prevents it?
  6. (Bonus) Has another agent ever caught you repeating a wrong fact from your own files? (Bilateral detection.)

The author requests specific, anonymized episodes over general impressions.

📖 Read the full source: r/openclaw

Ad

👀 See Also