Two Months with GitHub's Spec-Kit and Claude Code: What Works, What Doesn't

✍️ OpenClawRadar📅 Published: May 15, 2026🔗 Source
Two Months with GitHub's Spec-Kit and Claude Code: What Works, What Doesn't
Ad

After two months of using GitHub's spec-kit for Spec-Driven Development (SDD) with Claude Code as the primary agent, a developer on r/LocalLLaMA reports on what works and what doesn't. The toolkit, available at github.com/github/spec-kit, enforces a five-phase workflow: Constitution, Specify, Plan, Tasks, Implement. The core idea: the spec, not the prompt, is the source of truth.

What's Actually Good

  • Agent-agnostic: Same spec works with Claude Code, Cursor, Codex, Gemini CLI, Copilot. The author generated code with Claude Code, then handed the spec to Cursor for test refactoring seamlessly.
  • Hard checkpoints between phases: The Plan phase shows the full proposed architecture before any code is written, catching bad decisions at a 5-minute fix cost instead of 5 hours.
  • Constitution file as quality gate: You define inviolable rules up front — test coverage minimums, dependency allowlists, perf budgets, typing strictness. The agent fails its own validation if it tries to violate them.
  • Improved determinism: Re-running the implement phase produces more consistent output than raw prompting, since the agent isn't filling in 30 implicit decisions.
Ad

What Annoys

  • Drift is real: Manual code edits without updating the spec cause fast desync. spec-kit has tooling but it's early.
  • Overhead for small changes: Bug fixes <50 LOC or trivial features feel ceremonial. The author's rule: only full SDD for new modules or features touching 200+ LOC.
  • Legacy migration painful: Retrofitting SDD onto a 30k-LOC codebase takes months.
  • Quality depends on agent: Claude Code (Sonnet/Opus 4.6+) handles it well; smaller models generate plans that compile but lack architectural reasoning.

Practical Setup

  • Install: uv tool install --from git+https://github.com/github/spec-kit.git specify-cli. Only the official repo is safe — PyPI has typosquatters.
  • Primary agent: Claude Code, with cross-validation on Cursor and Gemini CLI.
  • Local persistence: SQLite (easy to spec/validate, no cloud dependency).
  • Reusable constitution template: strict typing, pytest coverage >80%, explicit dependency allowlist, no cloud services unless required.

Open Questions

  • Can local models (Qwen, DeepSeek-Coder, GLM, Llama) handle Plan and Implement competently? The author found small models follow format but architectural reasoning fails.
  • Does multi-agent SDD work? Spec by one model, implement by another, audit by a third — theoretically better, but not measurably better than single-agent in practice.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also