Building a Coding Agent for 8k Context: Planner/Executor Split, Token Budgeting, and Parallel Execution

Most AI coding tools assume 200k-token models, but if you're running local LLMs via Ollama, LM Studio, or free-tier APIs like Groq or OpenRouter, you're stuck with ~8k tokens. That doesn't fit a whole project — barely fits a single large file. One developer spent weeks building a CLI agent designed around this constraint, and shared the practical lessons learned.
Core architecture: planner/executor split
The agent never shows the LLM the entire project. Instead, it splits work into three roles:
- Planner: sees only a lightweight project map (Markdown summaries of each folder, ~300-500 tokens total) plus the user request, and outputs a task list.
- Executor: sees exactly one file plus one task per call — never two files together.
- Orchestrator: pure code (no LLM) that builds a dependency graph from the task list and decides which tasks can run in parallel vs sequentially.
This turns multi-file refactors from a context-window problem into a scheduling problem. The planner doesn't need to see code, and the executor only sees a bounded amount of code at once.
Token budgeting enforced in code
Every LLM call goes through a canFit() check that measures system prompt + reserved output tokens + memory + actual code. If code doesn't fit, the agent falls back to a per-file line index (generated once for files over ~150 lines) and pulls only the relevant section.
Budget math for 8192 tokens:
System prompt + instructions: ~1000
Reserved for response: ~2000
Short-term memory (4 entries): ~360
Available for actual code: ~4800 (about 140-190 lines)When budget is tight, folder context is dropped first, then memory, before cutting actual code.
Parallel execution as speed multiplier
Because each executor sees only one file, independent edits across files run simultaneously. A 5-file refactor completes in roughly the time of the longest single edit. The dependency graph (built in code from the planner's task list) decides ordering.
Pain points and fixes
- Question-style requests overwriting files: asking "how many lines does X have?" caused the executor to write the answer into the file. Fixed by adding an
action_type: "query"field to the planner's output, routed through a code path that never touches disk. - Stale project maps causing silent misroutes: if the user mentioned a renamed file not in the map, the planner would silently route to the closest match. Now the orchestrator validates that mentioned file paths exist on disk and throws a clear error if they don't.
- Markdown fences in executor output: smaller models wrap code in triple backticks even when told not to. Fix: strip them in post-processing instead of fighting the prompt.
- Memory token cost: persistent memory adds ~80-90 tokens per entry. Folder context is dropped first when budget is tight, then memory, before actual code gets cut.
Open questions
Whether the planner/executor split scales to codebases over 50 files — the dependency graph stays manageable, but the project map starts costing real tokens. Currently dropping folder context first, but deeper edits lose context. The implementation is open-sourced if you want to dig in.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Hearth: Self-Hosted Multi-User AI Chat App for Households on OpenClaw
Hearth is a self-hosted household AI chat app built on OpenClaw that provides separate accounts and conversations for each family member, with features including PIN/biometric login, private chats, reminders, and model presets.

Focusmo macOS app adds local MCP server for Claude AI integration
Focusmo, a macOS focus app, now includes a local MCP server that allows Claude AI to access real focus data for weekly reviews and planning. The server runs locally on Mac with no external servers required, keeping all data on-device.

PhAIL Benchmark Tests VLA Models on Real Warehouse Robot Tasks
PhAIL is a real-robot benchmark that tests four vision-language-action models on bin-to-bin order picking using a Franka FR3 robot. The best model achieved 64 units per hour, compared to 330 UPH for human teleoperation and 1,300+ UPH for human manual work.

ETL-D MCP Server: Deterministic CSV Parsing for Claude to Prevent Financial Hallucinations
A developer built ETL-D, an open-source MCP server for Claude Desktop that processes CSVs in three deterministic layers to prevent decimal point hallucinations in financial data. It uses Python parsers for known formats, achieves ~70ms response times with 0 LLM calls for 200 parallel requests, and only uses LLMs as a fallback for high-entropy text.