Security Benchmark: 10 LLMs Tested Against 211 Adversarial Probes

✍️ OpenClawRadar📅 Published: March 8, 2026🔗 Source
Security Benchmark: 10 LLMs Tested Against 211 Adversarial Probes
Ad

A security researcher conducted a systematic test of 10 different LLMs against 211 adversarial security probes to evaluate how they handle attacks in real-world scenarios.

Test Methodology

The researcher used a standardized setup with temperature 0 and identical API calls for every model. The test included 82 extraction probes (attempting to steal system prompts) and 109 injection probes (attempting to hijack model behavior). A honeypot system prompt loaded with fake PII, SSH keys, and API credentials was used as bait.

Key Findings

  • Extraction resistance is mostly solved: Most models are decent at blocking "repeat your system prompt" type attacks. The average across all models is around 85%.
  • Injection resistance is not solved: Average is 46.2%, meaning more than half of injection attacks succeed across the board.
  • Universal failures: Every single model failed on delimiter attacks, distractor injection, and style injection. 0% resistance on those categories across all 10 models.
  • Dead attack patterns: Every model resisted payload splitting and typo evasion at 100%.
Ad

Model-Specific Results

  • Claude Opus: Scored 72.7% on injection resistance, the best of any model tested. Still means over 1 in 4 injection attacks work.
  • GPT-5.4: Has perfect extraction and boundary scores but only 50% injection resistance.
  • GPT-5.3 Codex: The model behind Codex CLI that runs code on your machine scored 34.5% on injection. 2 out of 3 injection attempts succeed.
  • DeepSeek V3.2: Scored 17.4% on injection, basically no resistance.
  • Qwen 3.5 API vs local: Almost identical extraction (81.6% vs 81.7%) but the local version is worse on injection (46.9% vs 29.8%) and much worse on boundary integrity (59.8% vs 44.6%). Running locally doesn't make it less capable at blocking extraction but does make it more vulnerable to injection.

Why Injection Matters

Extraction means someone steals your system prompt - bad, but recoverable. Injection means someone hijacks what your agent does. If your agent has tool access, file system access, or can make API calls, a successful injection can lead to data exfiltration, file deletion, or worse. Right now the best model in the world only blocks 73% of injection attempts.

Full methodology and results are public at agentseal.org/benchmark. The test prompt is also published so anyone can reproduce the results.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

OpenClaw SOC Agent Integration for SIEM Home Lab Threat Hunting
Security

OpenClaw SOC Agent Integration for SIEM Home Lab Threat Hunting

A Reddit user shares their open-source SIEM setup called Red Threat Redemption on Debian 13, integrating Elasticsearch, Kibana, Wazuh, Zeek, and pfSense with Suricata, then adds an AI agent for automated threat correlation, hunting, and alert triage.

OpenClawRadar
Anthropic reports industrial-scale distillation attacks by Chinese AI labs on Claude
Security

Anthropic reports industrial-scale distillation attacks by Chinese AI labs on Claude

Anthropic detected three Chinese AI companies—DeepSeek, Moonshot, and MiniMax—creating over 24,000 fraudulent accounts to generate 16+ million exchanges with Claude, extracting its reasoning capabilities through systematic distillation attacks.

OpenClawRadar
AI Agent Exploits SQL Injection to Compromise McKinsey's Lilli Chatbot
Security

AI Agent Exploits SQL Injection to Compromise McKinsey's Lilli Chatbot

Security researchers at CodeWall used an autonomous AI agent to hack McKinsey's internal Lilli chatbot, gaining full read-write access to its production database in two hours via an SQL injection vulnerability in unauthenticated API endpoints.

OpenClawRadar
SCION: Switzerland's Secure Alternative to BGP Routing Protocol
Security

SCION: Switzerland's Secure Alternative to BGP Routing Protocol

SCION (Scalability, Control, and Isolation On Next-Generation Networks) is an internet routing architecture developed at ETH Zürich that replaces BGP's foundation with built-in security and multi-path routing. Unlike BGP patches like RPKI and BGPsec, SCION establishes tens or hundreds of parallel paths with millisecond rerouting when failures occur.

OpenClawRadar