TRELLIS.2 Image-to-3D Ported to Run Natively on Apple Silicon

✍️ OpenClawRadar📅 Published: April 20, 2026🔗 Source
TRELLIS.2 Image-to-3D Ported to Run Natively on Apple Silicon
Ad

What This Is

A port of Microsoft's TRELLIS.2 image-to-3D model that runs natively on Apple Silicon via PyTorch MPS, replacing CUDA-only dependencies with pure-PyTorch alternatives.

Key Details

The original TRELLIS.2 requires CUDA with flash_attn, nvdiffrast, and custom sparse convolution kernels that don't work on Mac. This port replaces those with:

  • A gather-scatter sparse 3D convolution implementation (backends/conv_none.py)
  • SDPA attention for sparse transformers using PyTorch's scaled_dot_product_attention
  • Python-based mesh extraction replacing CUDA hashmap operations (backends/mesh_extract.py)

Total changes are a few hundred lines across 9 files. All hardcoded .cuda() calls were patched to use the active device instead.

Performance & Requirements

On M4 Pro (24GB), generates ~400K vertex meshes from single photos in about 3.5 minutes. Memory usage peaks at around 18GB unified memory during generation.

Requirements:

  • macOS on Apple Silicon (M1 or later)
  • Python 3.11+
  • 24GB+ unified memory recommended
  • ~15GB disk space for model weights
Ad

Setup & Usage

Quick start:

git clone https://github.com/shivampkumar/trellis-mac.git
cd trellis-mac
hf auth login
bash setup.sh
source .venv/bin/activate
python generate.py path/to/image.png

You need to request access to gated models on HuggingFace: facebook/dinov3-vitl16-pretrain-lvd1689m and briaai/RMBG-2.0.

Basic usage:

python generate.py photo.png
python generate.py photo.png --seed 123 --output my_model --pipeline-type 512

Limitations

  • No texture export (meshes export with vertex colors only)
  • Hole filling disabled (meshes may have small holes)
  • Slower than CUDA (~10x slower for sparse convolution)
  • Inference only, no training support

Technical Implementation

The sparse 3D convolution builds a spatial hash of active voxels, gathers neighbor features for each kernel position, applies weights via matrix multiplication, and scatter-adds results back. Mesh extraction reimplements flexible_dual_grid_to_mesh using Python dictionaries instead of CUDA hashmap operations.

Benchmarks on M4 Pro (24GB), pipeline type 512:

  • Model loading: ~45s
  • Image preprocessing: ~5s
  • Sparse structure sampling: ~15s
  • Shape SLat sampling: ~90s
  • Texture SLat sampling: ~50s
  • Mesh decoding: ~30s
  • Total: ~3.5 min

📖 Read the full source: HN LLM Tools

Ad

👀 See Also