Orion: Bypassing CoreML to Run and Train LLMs Directly on Apple Neural Engine

Direct ANE Access for LLM Workloads
Orion provides an end-to-end Objective-C system that bypasses CoreML entirely to run and train LLMs directly on the Apple Neural Engine (ANE). This approach gives developers direct control over the ANE, which has previously been treated as a black-box scheduler by CoreML, stripping away any direct control or ability to train.
Technical Implementation and Constraints
The project builds on reverse-engineering work that mapped the private ANEClient and ANECompiler APIs. The ANE presents what the developer calls a "hardware impedance mismatch" with 17 total programming constraints, 11 of which were completely undocumented. Key constraints include:
- The concat operation causes an immediate, silent compiler failure
- BLOBFILE weights require a 64-byte offset from the chunk header, or you get silent numerical corruption
- The ANE maintains internal state that hard-caps at ~119 compilations per process before silently failing
Solutions to Training Challenges
Previous attempts at ANE training hit NaN divergence after a single step. Orion solves this by:
- Wiring up a deferred compilation pipeline
- Implementing strict activation clamping to stop fp16 overflow cascade (clamping activations to -65504 to +65504)
- Using an exec() process restart loop after every training step to bypass the 119-compilation limit
Performance Results
The compiler lowers a 27-operation graph IR through five optimization passes down to ANE-native MIL. Current performance includes:
- 170+ tokens/s for GPT-2 124M decode
- Mechanically stable multi-step training on a 110M parameter transformer (the "coherence ceiling" of the hardware)
- Over 1,000 steps, loss dropped from 12.3 to 6.2 with zero NaNs
Current Limitations
The ANE bakes weights at compile time, meaning every training update requires a ~4.2s recompilation penalty. The ANE pulls ~19 TFLOPS in fp16, but the fundamental constraint to using it hasn't been compute—it's been the complete lack of a native orchestration layer.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Code-Graph-MCP: Open Source MCP Server Reduces Claude Code Token Usage by 40-60%
code-graph-mcp is an MCP server that indexes codebases into an AST knowledge graph, replacing multiple grep/read calls with single structured queries. The developer reports 40-60% total session token savings and 80% fewer tool calls per navigation task.

Design Studio Plugin for Claude Code Adds Virtual Design Team with 9 Roles and 16 Commands
A new Claude Code plugin called Design Studio simulates a full design team with 9 specialist roles, 16 slash commands, and 5 agents. It auto-detects tech stacks and includes over 8,000 lines of design knowledge across reference files.

Claude Code Container Provides Zero-Config Docker Isolation for Claude Code
Claude Code Container (ccc) is a free, open-source tool that automatically creates per-project Docker containers for Claude Code with full isolation and zero configuration. It forwards host environment variables, mounts SSH keys, provides transparent localhost proxy, and includes Chromium with chrome-devtools MCP pre-configured.

PageAgent: Browser AI Agent That Runs Inside Web Pages with Ollama Support
PageAgent is a JavaScript library that runs AI agents directly inside web pages, reading live DOM as text instead of using screenshots. It works with any OpenAI-compatible endpoint including Ollama, enabling local LLM calls directly from the browser.