How AI assistants fetch web pages: Nginx log analysis of ChatGPT, Claude, Gemini and others

✍️ OpenClawRadar📅 Published: April 20, 2026🔗 Source
How AI assistants fetch web pages: Nginx log analysis of ChatGPT, Claude, Gemini and others
Ad

A developer conducted a practical experiment to determine whether AI assistants fetch web pages live or answer from cached indexes when users ask about specific sites. By setting up custom Nginx logging and prompting major chatbots with unique query strings, they captured clear evidence of retrieval behavior.

The probe setup

The test used a custom Nginx log format to capture headers that the default combined log compresses:

log_format ai_probe escape=json '{' '"time":"$time_iso8601",' '"ip":"$remote_addr",' '"uri":"$request_uri",' '"status":$status,' '"ua":"$http_user_agent",' '"referer":"$http_referer",' '"accept":"$http_accept"' '}';

Each assistant received a prompt pointing to a unique query string (/?ai=chatgpt, /?ai=claude, etc.), making attribution straightforward. Prompts were rerun across sessions to avoid transient cache hits masking retrieval patterns.

Who announced themselves with dedicated user-agents

Five assistants arrived with retrieval-specific signals:

  • ChatGPT: ChatGPT-User/1.0 (Chrome-style Accept, no robots.txt check)
  • Claude: Claude-User/1.0 (*/* Accept, always checks robots.txt first)
  • Perplexity: Perplexity-User/1.0 (empty Accept header)
  • Meta AI: meta-webindexer/1.1 (*/* Accept, no robots.txt check)
  • Manus: Manus-User/1.0 suffix on Chrome UA (Chrome-style Accept)

All five fetched the page directly from the origin.

Who didn't announce themselves

  • Gemini: Zero requests from any Google user-agent during prompt window. Answered entirely from its own index without performing a live provider-side fetch.
  • Copilot: Plain Chrome 135 on Linux x86_64, full browser-style Accept. Fetched but indistinguishable from human visitors.
  • Grok: Plain Mac Safari 26 and plain Mac Chrome 143. Fetched but indistinguishable from human visitors.
Ad

Key behavioral patterns observed

ChatGPT: Hits from multiple source IPs within the same burst, typically pulling several candidate pages at once while deciding which to cite. In a 24-hour production window, ChatGPT-User requests came from five distinct Azure ranges: 23.98.x.x, 20.215.x.x, 40.67.x.x, 51.8.x.x, and 51.107.x.x.

Claude: Always fetches /robots.txt before every page fetch, from Anthropic-owned IP space in the 216.73.216.0/24 range. Follows redirects cleanly, including trailing-slash normalization. Anthropic runs three distinct bots: Claude-User (user-initiated retrieval), Claude-SearchBot (search index), and ClaudeBot (training crawler).

Perplexity: Direct fetch with no Accept header or referrer. PerplexityBot (their search-indexing crawler) separately pinged /robots.txt. The author notes Perplexity can retrieve live but doesn't have to, as it can answer from its own index.

Gemini: No live provider-side fetch observed. Google doesn't publish a retrieval-specific user-agent for Gemini, and according to Google's crawler documentation, AI Overviews and AI Mode ground on the same Search index that Googlebot populates.

The experiment distinguishes between two signals: provider-side fetch (assistant hits origin with dedicated user-agent) and real clickthrough visits (human reads AI answer and clicks citation, arriving as normal browser with assistant as referrer). Combining both into a single "AI-traffic" number hides this useful distinction.

📖 Read the full source: HN AI Agents

Ad

👀 See Also