Two Months with GitHub's Spec-Kit and Claude Code: What Works, What Doesn't

After two months of using GitHub's spec-kit for Spec-Driven Development (SDD) with Claude Code as the primary agent, a developer on r/LocalLLaMA reports on what works and what doesn't. The toolkit, available at github.com/github/spec-kit, enforces a five-phase workflow: Constitution, Specify, Plan, Tasks, Implement. The core idea: the spec, not the prompt, is the source of truth.
What's Actually Good
- Agent-agnostic: Same spec works with Claude Code, Cursor, Codex, Gemini CLI, Copilot. The author generated code with Claude Code, then handed the spec to Cursor for test refactoring seamlessly.
- Hard checkpoints between phases: The Plan phase shows the full proposed architecture before any code is written, catching bad decisions at a 5-minute fix cost instead of 5 hours.
- Constitution file as quality gate: You define inviolable rules up front — test coverage minimums, dependency allowlists, perf budgets, typing strictness. The agent fails its own validation if it tries to violate them.
- Improved determinism: Re-running the implement phase produces more consistent output than raw prompting, since the agent isn't filling in 30 implicit decisions.
What Annoys
- Drift is real: Manual code edits without updating the spec cause fast desync. spec-kit has tooling but it's early.
- Overhead for small changes: Bug fixes <50 LOC or trivial features feel ceremonial. The author's rule: only full SDD for new modules or features touching 200+ LOC.
- Legacy migration painful: Retrofitting SDD onto a 30k-LOC codebase takes months.
- Quality depends on agent: Claude Code (Sonnet/Opus 4.6+) handles it well; smaller models generate plans that compile but lack architectural reasoning.
Practical Setup
- Install:
uv tool install --from git+https://github.com/github/spec-kit.git specify-cli. Only the official repo is safe — PyPI has typosquatters. - Primary agent: Claude Code, with cross-validation on Cursor and Gemini CLI.
- Local persistence: SQLite (easy to spec/validate, no cloud dependency).
- Reusable constitution template: strict typing, pytest coverage >80%, explicit dependency allowlist, no cloud services unless required.
Open Questions
- Can local models (Qwen, DeepSeek-Coder, GLM, Llama) handle Plan and Implement competently? The author found small models follow format but architectural reasoning fails.
- Does multi-agent SDD work? Spec by one model, implement by another, audit by a third — theoretically better, but not measurably better than single-agent in practice.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Design vs Huashu-Design: A Head-to-Head on HTML Layouts and Rate Limits
Claude Design builds HTML prototypes fast but hits rate limits quickly. Huashu-Design, an open-source Claude Code skill, runs on the normal subscription with no separate rate limit—but takes 20 minutes vs 5.

Brain-MCP Developer Documents Tools for Claude AI Instead of Humans
A developer maintaining the Brain-MCP server added a 'For AI Assistants' section to documentation with behavioral instructions, resulting in Claude using tools more intelligently and proactively injecting context when topics change.

Claude Code Auto Mode: Safer Alternative to Skipping Permissions
Claude Code now offers auto mode, a permissions mode where Claude makes permission decisions with safeguards monitoring actions before execution. It's available as a research preview for Team plan users, with Enterprise and API rollout coming soon.

Double-Buffering Technique for LLM Context Windows Eliminates Stop-the-World Compaction
A technique called double-buffering can prevent LLM agents from freezing during context window compaction by summarizing early and maintaining two buffers, allowing seamless handoff at no extra inference cost.