Local Qwen Models Achieve Browser Automation with Stepwise Planning and Compact DOM

Stepwise Planning Overcomes Upfront Planning Failures
The developer discovered that asking models to invent a full multi-step plan before seeing the real page state works on familiar sites but breaks quickly on unexpected elements. What worked better was stepwise planning where the model replans from the current DOM snapshot at each step.
Example Flow on Ace Hardware
The tested flow with Qwen 8B as planner and 4B as executor on Ace Hardware (a site the model had no prior task for) completed a full cart flow with zero vision model usage. The stepwise approach looked like this:
- Step 1: see search box → TYPE "grass mower"
- Step 2: see results → CLICK Add to Cart
- Step 3: drawer appears → dismiss it
- Step 4: cart visible → CLICK View Cart
- Step 5: DONE
Compact DOM Representation Enables Small Models
The model never sees raw HTML or screenshots—just a semantic table representation:
id|role|text|importance|bg|clickable|nearby_text
665|button|Proceed to checkout|675|orange|1|
761|button|Add to cart|720|yellow|1|$299.99
1488|link|ThinkPad E16|478|none|1|Laptop 16"
This allows the 4B executor to pick an element ID from a short list. Vision approaches burn 2-3K tokens per screenshot, easily 50-100K+ for a full flow, while compact snapshots use ~15K total for the same task.
Modal Handling Critical for Success
After each click, if the DOM suddenly grows, the agent scans for dismiss patterns (close, ×, no thanks, etc.) before planning again. This fixed many failures that appeared to be "bad reasoning" but were actually hidden overlays.
The developer notes being curious if others are seeing stepwise planning beat upfront planning once sites get unfamiliar.
📖 Read the full source: r/LocalLLaMA
👀 See Also

CloudRouter Empowers AI Coding Agents with VM and GPU Management
CloudRouter introduces a CLI tool that allows AI coding agents to autonomously spin up cloud VMs and GPUs, automating tasks like browser verification and GPU-intensive workloads.

OpenClaw Developer Achieves AI Agent Breakthroughs with Uber and Restaurant Booking Automation
An OpenClaw developer has successfully created AI agents that autonomously complete Uber ride bookings and restaurant reservations on real websites, overcoming bot detection and CAPTCHAs using a stack with stealth browsers, residential proxies, and CAPTCHA solving.

OpenObscure: Open-Source On-Device Privacy Firewall for AI Agents
OpenObscure is an open-source, on-device privacy firewall that sits between AI agents and LLM providers, using FF1 Format-Preserving Encryption to encrypt PII values before requests leave your device. It includes PII detection with 99.7% recall, cognitive firewall scanning, and runs on macOS/Linux/Windows with iOS/Android bindings.

Code Evolution Method Triples LLM Performance on ARC-AGI-2 Benchmark
Researchers achieved a 2.8x improvement on the ARC-AGI-2 benchmark using code evolution with open-weight models, reaching 34% accuracy at $2.67 per task. The same method pushed Gemini 3.1 Pro to 95% accuracy at $8.71 per task.