Reverse Engineering Apple Neural Engine for Training MicroGPT Models

Direct Access to Apple's Neural Engine
A developer has bypassed Apple's CoreML framework to directly access the Apple Neural Engine (ANE) on an M4 Mac mini, creating a custom training pipeline for small language models. The project involved reverse engineering ANE's private APIs using Claude, then running benchmarks and implementing training without Apple's recommended CoreML interface.
Technical Specifications and Performance
The ANE on the M4 chip provides 38 TFLOPS of claimed INT8 compute, though the developer notes it's actually a FP16 processor, making the effective compute half that amount. Peak compute on the ANE consumes only 2.8W, resulting in 6.6 TFLOPS/watt efficiency. For comparison, Metal GPU achieves approximately 1 TFLOPS/watt, while NVIDIA's H100 reaches 1.4 TFLOPS/watt.
Training Implementation
The developer created a bespoke training pipeline that successfully trained a 110M parameter MicroGPT model on the ANE. While a single chip can't practically train larger models, the developer suggests a cluster of ANE devices could theoretically train bigger models. Even on a single device, LoRA training for 3B or 7B parameter models should be feasible.
Why Train on NPUs?
The primary motivation is power efficiency. The ANE's 6.6 TFLOPS/watt efficiency makes it significantly more power-efficient than traditional GPU training methods, which is particularly valuable for edge computing and energy-conscious development.
Available Resources
- Reverse Engineering documentation
- Benchmark results
- Training implementation (Work in Progress)
- GitHub repository with code
The project demonstrates that Apple's Neural Engine, typically treated as a black box, can be accessed directly for custom AI training workflows, offering developers an alternative to GPU-based training with superior power efficiency.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw skill reduces accessibility tree tokens from 600K to 1.3K for ad-heavy sites
A developer built an OpenClaw skill that uses ML-based element ranking to prune accessibility trees, cutting slickdeals.com from ~598K tokens to ~1.3K tokens by keeping only the top ~50 actionable elements.

Claude IDE Bridge: MCP Tool for Remote Editor Access
Claude IDE Bridge is an open-source tool that provides Claude AI with remote control access to code editors via MCP (Model Context Protocol). It exposes editor knowledge like live type information and debugger state as callable tools.

MCP-India-Stack: Offline-first server for Indian financial data in AI agents
MCP-India-Stack is an offline-first MCP server that provides Indian financial and government API functionality without authentication or external API calls. It bundles datasets locally for tax calculations, validation tools, and lookups.

Benchmark Results: 6 Low-Cost Models vs. Claude Sonnet 4.6 for OpenClaw Orchestration
A developer tested six cheaper AI models against Claude Sonnet 4.6 as the main orchestrator for an OpenClaw setup. Only o4-mini matched Sonnet's perfect score, while others failed on critical judgment tasks like file inspection and delegation.