Needle: A 26M Parameter Function-Calling Model That Runs at 6000 tok/s on Mobile

✍️ OpenClawRadar📅 Published: May 12, 2026🔗 Source
Ad

Cactus has open-sourced Needle, a 26M parameter function-calling model designed to run on budget phones, watches, and glasses. It achieves 6000 tok/s prefill and 1200 tok/s decode on consumer devices using their custom inference engine, Cactus.

Architecture: Simple Attention Networks

Needle uses a Simple Attention Network — no MLPs anywhere. The entire model consists of attention and gating layers. Key design: d=512, 8H/4KV, BPE=8192, with an encoder-decoder structure (12 encoder layers, 8 decoder layers) using cross-attention, masked self-attention with RoPE, and tied embeddings.

Training Details

  • Pretrained on 200B tokens across 16 TPU v6e (27 hours)
  • Post-trained on 2B tokens of synthesized function-calling data (45 minutes)
  • Data synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

Benchmark Results

Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling. However, those models have more scope/capacity and excel in conversational settings.

Quickstart

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

Opens a web UI at http://127.0.0.1:7860 for testing and fine-tuning on your own tools.

Ad

Usage (Python)

from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl") model = SimpleAttentionNetwork(config) tokenizer = get_tokenizer()

result = generate( model, params, tokenizer, query="What's the weather in San Francisco?", tools='[{"name":"get_weather","parameters":{"location":"string"}}]', stream=False ) print(result)

[{"name":"get_weather","arguments":{"location":"San Francisco"}}]

Fine-tuning Locally

# via playground (auto-generates data via Gemini)

needle playground

or provide your own data

needle finetune data.jsonl

Availability

Weights are on Hugging Face: Cactus-Compute/needle. Everything is MIT licensed.

📖 Read the full source: HN AI Agents

Ad

👀 See Also