GitHub Spec-Kit & Claude Code: 5 Workflow Phases Reviewed

After two months of using GitHub's spec-kit for Spec-Driven Development (SDD) with Claude Code as the primary agent, a developer on r/LocalLLaMA reports on what works and what doesn't. The toolkit, available at github.com/github/spec-kit, enforces a five-phase workflow: Constitution, Specify, Plan, Tasks, Implement. The core idea: the spec, not the prompt, is the source of truth.

What's Actually Good

Agent-agnostic: Same spec works with Claude Code, Cursor, Codex, Gemini CLI, Copilot. The author generated code with Claude Code, then handed the spec to Cursor for test refactoring seamlessly.
Hard checkpoints between phases: The Plan phase shows the full proposed architecture before any code is written, catching bad decisions at a 5-minute fix cost instead of 5 hours.
Constitution file as quality gate: You define inviolable rules up front — test coverage minimums, dependency allowlists, perf budgets, typing strictness. The agent fails its own validation if it tries to violate them.
Improved determinism: Re-running the implement phase produces more consistent output than raw prompting, since the agent isn't filling in 30 implicit decisions.

What Annoys

Drift is real: Manual code edits without updating the spec cause fast desync. spec-kit has tooling but it's early.
Overhead for small changes: Bug fixes <50 LOC or trivial features feel ceremonial. The author's rule: only full SDD for new modules or features touching 200+ LOC.
Legacy migration painful: Retrofitting SDD onto a 30k-LOC codebase takes months.
Quality depends on agent: Claude Code (Sonnet/Opus 4.6+) handles it well; smaller models generate plans that compile but lack architectural reasoning.

Practical Setup

Install: uv tool install --from git+https://github.com/github/spec-kit.git specify-cli. Only the official repo is safe — PyPI has typosquatters.
Primary agent: Claude Code, with cross-validation on Cursor and Gemini CLI.
Local persistence: SQLite (easy to spec/validate, no cloud dependency).
Reusable constitution template: strict typing, pytest coverage >80%, explicit dependency allowlist, no cloud services unless required.

Open Questions

Can local models (Qwen, DeepSeek-Coder, GLM, Llama) handle Plan and Implement competently? The author found small models follow format but architectural reasoning fails.
Does multi-agent SDD work? Spec by one model, implement by another, audit by a third — theoretically better, but not measurably better than single-agent in practice.

📖 Read the full source: r/LocalLLaMA