How to Optimize Apple Neural Engine: 55 Experiments Show Kernel Fusion

A developer conducted 55 optimization experiments on the autoresearch-ane fork, primarily steering the process from their phone on a Saturday. The work focused on Apple Neural Engine (ANE) performance improvements through kernel optimization and architectural changes.

Performance Improvements

The experiments yielded measurable gains across several metrics:

Validation loss decreased from 3.75 (a throwback from optimized 3.2) to 2.49
Step time improved from 176ms to 96ms
ANE utilization increased from 3.6% to 6.5%

Key Technical Change

The most significant improvement came from kernel fusion: "Fusing 3 ANE kernels into 1 mega-kernel eliminated 12 IOSurface round-trips per step - that single change beat every hyperparameter tweak combined." This architectural optimization proved more impactful than parameter adjustments.

Workflow Details

The developer used an unconventional approach:

Ran experiments remotely, steering from their phone in brief moments
Used Claude for brainstorming and pulling insights from public sources listed in the repository README
Approached the problem with "short attention and minimal token input" - speculating on directions rather than dictating precise steps
Completed 55 experiments with "several cases of actual typing"
Worked in non-destructive mode only due to permission constraints ("no rm -rf /* and such")

Main Learning

Beyond the technical improvements, the developer noted: "Main learning isn't the improvement itself. It's that short attention and minimal token input - brainstorming direction, not dictating steps - can produce real measurable gains on a hard systems problem."

The work was conducted on the developer's laptop, and they mention an acceptance rate discrepancy: "55vs45 not quite mathing" in reference to experiment outcomes.

📖 Read the full source: r/LocalLLaMA