Needle: A 26M Parameter Tool-Calling Model Built Entirely Without FFNs
Needle is a 26M parameter model designed specifically for single-shot function calling. It uses cross-attention and gating layers with zero FFNs, based on the insight that tool calling is retrieval-and-assembly (match query to tool name, extract argument values, emit JSON) rather than reasoning. The model runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.
Training Details
- Pretrained on 200B tokens across 16 TPU v6e (27 hours)
- Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
- Data synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)
Architecture: Simple Attention Networks
The entire model is just attention and gating — no MLPs anywhere. The authors argue that FFN parameters are wasted at this scale for tool calling, and that the 'no FFN' finding generalizes to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input.
Benchmarks
Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling, though those models have more capacity for conversational settings.
How to Use
# Test the model via the playground or finetune on your Mac/PC
git clone https://github.com/cactus-compute/needle
- GitHub: github.com/cactus-compute/needle
- Weights: huggingface.co/Cactus-Compute/needle
- Architecture writeup: Simple Attention Networks docs
- Inference engine for mobile/wearables (Cactus): github.com/cactus-compute/cactus
Everything is MIT licensed.
📖 Read the full source: r/LocalLLaMA
👀 See Also

SIDJUA v0.9.7: Open Source Multi-Agent AI with Pre-Action Governance Enforcement
SIDJUA v0.9.7 is a self-hosted, open source multi-agent AI framework that enforces governance rules before agents act, blocking unauthorized actions like budget overruns or scope violations. It supports multiple LLM providers, runs on 4GB RAM, and includes a desktop GUI built with Tauri v2.

Paper Lantern MCP Server Connects Claude Code to Research Papers
Paper Lantern is an MCP server built with Claude Code that connects coding agents to over 2 million CS and 43 million biomedical research papers, enabling them to find benchmarked methods instead of defaulting to training data.

nervx: CLI tool reduces Claude Code token usage by analyzing codebase structure
nervx is a pip-installable CLI tool that parses repositories with tree-sitter, builds a SQLite graph of functions and imports, and generates a NERVX.md structural map. It automatically adds instructions to CLAUDE.md that teach Claude to use nervx navigation, reducing grep searches by 65% and output tokens by 48% in testing.

Governor: A Claude Code Plugin to Cut Token Waste via Output Compression, Context Slimming, and Tool Filtering
Governor is a Claude Code plugin that reduces token/context waste through compact professional output, memory file compression, tool-output filtering, and drift guardrails. Benchmarks show 55.5% output token savings vs control.