91K AI Agent Interactions Reveal 6.4% Tool Abuse Rise

Threat landscape from production AI agent data

Real-world threat data from 91,284 AI agent interactions across 47 deployments shows 35,711 threats detected in February 2026. The detection model uses a Gemma-based 5-head multilabel classifier.

Key threats for self-hosted deployments

Tool/command abuse: Increased 6.4% to 14.5% of threats. The dominant pattern is tool chain escalation where a harmless read call is followed by a write or execute. Most local setups give agents tool access without sufficient safeguards.
Agent goal hijacking: Doubled to 6.9% of threats. Targets the planning phase in autonomous agent loops, particularly relevant for local setups with less monitoring on agent state.
RAG poisoning: Shifted to metadata attacks at 12.0% (up from 10.0%). New pattern targets document metadata (titles, authors, annotations) rather than content. Most people sanitize content but pass metadata through as-is.
Multimodal injection: New threat at 2.3% where instructions are hidden in images and PDFs. Text-only safety scanning misses these attacks.

Threat breakdown percentages

Data Exfiltration: 18.0% (-1.2 MoM change)
Tool/Command Abuse: 14.5% (+6.4)
RAG/Context Attack: 12.0% (+2.0)
Jailbreak: 11.0% (-1.3)
Prompt Injection: 8.1% (-0.7)
Agent Goal Hijack: 6.9% (+3.3)
Inter-Agent Attack: 5.0% (+1.6)

Detection approach

The detection pipeline uses two layers: L1 is pattern matching with 218 rules (sub-ms latency, runs entirely locally), and L2 is Gemma-based. The full community edition is open source at github.com/raxe-ai/raxe-ce.

📖 Read the full source: r/LocalLLaMA