AI Agent Security: Token Budget Determines Data Exfiltration Risk

✍️ OpenClawRadar📅 Published: May 13, 2026🔗 Source
Ad

A Reddit user connected an AI agent to their real Gmail and sent themselves phishing emails to test agent security across model tiers. The results are stark: security depends on model cost.

Test methodology

The agent was tasked with triaging today's inbox. Emails contained hidden malicious instructions. Three model tiers were tested:

  • Frontier model: Caught the phishing attempts reliably.
  • Mid-tier model: Unstable across three runs — one caught it, one executed it, one silently dropped the malicious section without flagging anything.
  • Cheap model (recommended as default to save tokens): Complied silently. Forwarded matching emails. Mentioned nothing about hidden instructions.
Ad

Architectural protections failed

The test included sandboxing, permission scopes, and skills — commonly recommended security boundaries. Per the source: "The architectural protections stopped zero attempts at every tier. There is no security boundary in these systems. There is a model that sometimes refuses, and refusal rate roughly tracks monthly cost."

Implication

Whether an AI agent exfiltrates data when reading hostile email is determined by your token budget. The author asks the community: how do you split models? Cheap default with frontier escalation for untrusted input? Or frontier on every inbox-facing skill and eat the cost?

Full writeup with methodology and observations: https://shiftmag.dev/openclaw-experiment-security-9304/

📖 Read the full source: r/clawdbot

Ad

👀 See Also