ANE Optimization Through Phone-Steered AI Experiments Shows Kernel Fusion Benefits

A developer conducted 55 optimization experiments on the autoresearch-ane fork, primarily steering the process from their phone on a Saturday. The work focused on Apple Neural Engine (ANE) performance improvements through kernel optimization and architectural changes.
Performance Improvements
The experiments yielded measurable gains across several metrics:
- Validation loss decreased from 3.75 (a throwback from optimized 3.2) to 2.49
- Step time improved from 176ms to 96ms
- ANE utilization increased from 3.6% to 6.5%
Key Technical Change
The most significant improvement came from kernel fusion: "Fusing 3 ANE kernels into 1 mega-kernel eliminated 12 IOSurface round-trips per step - that single change beat every hyperparameter tweak combined." This architectural optimization proved more impactful than parameter adjustments.
Workflow Details
The developer used an unconventional approach:
- Ran experiments remotely, steering from their phone in brief moments
- Used Claude for brainstorming and pulling insights from public sources listed in the repository README
- Approached the problem with "short attention and minimal token input" - speculating on directions rather than dictating precise steps
- Completed 55 experiments with "several cases of actual typing"
- Worked in non-destructive mode only due to permission constraints ("no rm -rf /* and such")
Main Learning
Beyond the technical improvements, the developer noted: "Main learning isn't the improvement itself. It's that short attention and minimal token input - brainstorming direction, not dictating steps - can produce real measurable gains on a hard systems problem."
The work was conducted on the developer's laptop, and they mention an acceptance rate discrepancy: "55vs45 not quite mathing" in reference to experiment outcomes.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claudius: Open-Source Embeddable AI Chat Widget for Claude
Claudius is an open-source, self-hosted chat widget powered by Claude that can be embedded on any website with one script tag. It runs on Cloudflare Workers with a React frontend and includes features like custom system prompts, rate limiting, and accessibility compliance.

Merlin: Local-first LLM context dedup – measure up to 71% chunk overlap, free & open-core
Merlin is a local-first context dedup tool that measured 22-71% chunk overlap across 22M passages from real agent/RAG sessions. Ships as HTTP proxy (Ollama/vLLM/SGLang/llama.cpp), MCP server (Claude/Cursor/OpenClaw), or standalone CLI. MIT open-core with daily usage caps.

Omnara: Run Claude Code and Codex from Anywhere
Omnara is a web and mobile IDE that lets developers run and interact with Claude Code and Codex sessions from anywhere, with features like cloud syncing and a voice agent.

nervx: CLI tool reduces Claude Code token usage by analyzing codebase structure
nervx is a pip-installable CLI tool that parses repositories with tree-sitter, builds a SQLite graph of functions and imports, and generates a NERVX.md structural map. It automatically adds instructions to CLAUDE.md that teach Claude to use nervx navigation, reducing grep searches by 65% and output tokens by 48% in testing.