Build a Coding Agent for 8k Context: Planner, Token Budget, Parallel

Most AI coding tools assume 200k-token models, but if you're running local LLMs via Ollama, LM Studio, or free-tier APIs like Groq or OpenRouter, you're stuck with ~8k tokens. That doesn't fit a whole project — barely fits a single large file. One developer spent weeks building a CLI agent designed around this constraint, and shared the practical lessons learned.

Core architecture: planner/executor split

The agent never shows the LLM the entire project. Instead, it splits work into three roles:

Planner: sees only a lightweight project map (Markdown summaries of each folder, ~300-500 tokens total) plus the user request, and outputs a task list.
Executor: sees exactly one file plus one task per call — never two files together.
Orchestrator: pure code (no LLM) that builds a dependency graph from the task list and decides which tasks can run in parallel vs sequentially.

This turns multi-file refactors from a context-window problem into a scheduling problem. The planner doesn't need to see code, and the executor only sees a bounded amount of code at once.

Token budgeting enforced in code

Every LLM call goes through a canFit() check that measures system prompt + reserved output tokens + memory + actual code. If code doesn't fit, the agent falls back to a per-file line index (generated once for files over ~150 lines) and pulls only the relevant section.

Budget math for 8192 tokens:

System prompt + instructions: ~1000
Reserved for response: ~2000
Short-term memory (4 entries): ~360
Available for actual code: ~4800 (about 140-190 lines)

When budget is tight, folder context is dropped first, then memory, before cutting actual code.

Parallel execution as speed multiplier

Because each executor sees only one file, independent edits across files run simultaneously. A 5-file refactor completes in roughly the time of the longest single edit. The dependency graph (built in code from the planner's task list) decides ordering.

Pain points and fixes

Question-style requests overwriting files: asking "how many lines does X have?" caused the executor to write the answer into the file. Fixed by adding an action_type: "query" field to the planner's output, routed through a code path that never touches disk.
Stale project maps causing silent misroutes: if the user mentioned a renamed file not in the map, the planner would silently route to the closest match. Now the orchestrator validates that mentioned file paths exist on disk and throws a clear error if they don't.
Markdown fences in executor output: smaller models wrap code in triple backticks even when told not to. Fix: strip them in post-processing instead of fighting the prompt.
Memory token cost: persistent memory adds ~80-90 tokens per entry. Folder context is dropped first when budget is tight, then memory, before actual code gets cut.

Open questions

Whether the planner/executor split scales to codebases over 50 files — the dependency graph stays manageable, but the project map starts costing real tokens. Currently dropping folder context first, but deeper edits lose context. The implementation is open-sourced if you want to dig in.

📖 Read the full source: r/LocalLLaMA