Using a Local LLM as a Claude Code Subagent to Reduce Context Usage

Claude Code can orchestrate tasks by delegating to a local LLM running on your machine, similar to how it uses Claude subagents. This approach keeps file content out of Claude's context—only the local model's summary and insights are passed back.
How It Works
A small Python script (~120 lines, standard library only) runs an agent loop:
- You pass Claude a task description without file content
- The script sends it to LM Studio's
/v1/chat/completionsendpoint withread_fileandlist_dirtool definitions - The local model calls those tools itself to read the files it needs
- The loop continues until it produces a final answer
- Claude sees only the result
Example command:
python3 agent_lm.py --dir /path/to/project "summarize solar-system.html"
This results in:
- [turn 1] →
read_file({'path': 'solar-system.html'}) - [turn 2] → This HTML file creates an interactive animated solar system...
The file content goes into the local model's context (tested with Qwen's context), not Claude's.
Use Cases and Limitations
Based on testing with Qwen3.5 35B 4-bit via MLX on Apple Silicon, this approach is good for:
- Code summarization and explanation
- Bug finding
- Boilerplate / first-draft generation
- Text transformation and translation (tested with Hebrew)
- Logic tasks and reasoning (use
--thinkflag for harder problems)
It's not good for:
- Tasks that require Claude's full context
- Multi-file understanding where relationships matter
- Tasks needing the current conversation history
- Anything where accuracy is critical
Think of it as a Haiku-tier assistant, not a replacement for Claude.
Setup Requirements
- LM Studio running locally with the API server enabled
- One Python script for the agent loop, one for simple prompt-only queries
- Both wired into a global
~/.claude/CLAUDE.mdso Claude Code knows to offer delegation when relevant - No MCP server, no pip dependencies, no plugin infrastructure needed
Configuration tip: Add {%- set enable_thinking = false %} to the top of the Jinja template. For most tasks, you don't need the local model to reason, and this saves time and tokens while increasing speed with no real degradation in quality for such tasks.
📖 Read the full source: r/ClaudeAI
👀 See Also

Open source PR review agent PrixAI detects all 10/10 planted bugs at 6x lower cost than CodeRabbit
A Reddit user built PrixAI, an open source PR review agent that uses local/cheap inference models to match CodeRabbit's features at 6x less cost, detecting all 10 intentionally planted issues in a test PR.

Phantom: A Persistent AI Agent Built with Claude's Agent SDK
Phantom is an open-source Bun/TypeScript process that wraps Claude's Agent SDK (Opus 4.6) with persistent vector memory, a self-evolution engine, and an MCP server interface. It runs continuously on its own VM or Docker Compose and communicates via Slack.

SourceBridge: Open-source tool for codebase analysis using local LLMs
SourceBridge is an open-source tool that indexes Git repositories into symbol graphs and uses local LLMs to generate codebase summaries, architecture walkthroughs, and learning materials. It supports multiple local backends including Ollama, llama.cpp, vLLM, LM Studio, and SGLang via OpenAI-compatible APIs.

Kubeez MCP Server Connects Claude to 70+ AI Media Models
Kubeez has released an MCP server that connects Claude to over 70 AI models for image, video, music, and voice generation. The server supports OAuth authentication and provides async generation with Claude polling for status and returning CDN URLs.