TRELLIS.2 Ported to Apple Silicon: 3D from One Photo in 3.5 Min

What This Is

A port of Microsoft's TRELLIS.2 image-to-3D model that runs natively on Apple Silicon via PyTorch MPS, replacing CUDA-only dependencies with pure-PyTorch alternatives.

Key Details

The original TRELLIS.2 requires CUDA with flash_attn, nvdiffrast, and custom sparse convolution kernels that don't work on Mac. This port replaces those with:

A gather-scatter sparse 3D convolution implementation (backends/conv_none.py)
SDPA attention for sparse transformers using PyTorch's scaled_dot_product_attention
Python-based mesh extraction replacing CUDA hashmap operations (backends/mesh_extract.py)

Total changes are a few hundred lines across 9 files. All hardcoded .cuda() calls were patched to use the active device instead.

Performance & Requirements

On M4 Pro (24GB), generates ~400K vertex meshes from single photos in about 3.5 minutes. Memory usage peaks at around 18GB unified memory during generation.

Requirements:

macOS on Apple Silicon (M1 or later)
Python 3.11+
24GB+ unified memory recommended
~15GB disk space for model weights

Setup & Usage

Quick start:

git clone https://github.com/shivampkumar/trellis-mac.git
cd trellis-mac
hf auth login
bash setup.sh
source .venv/bin/activate
python generate.py path/to/image.png

You need to request access to gated models on HuggingFace: facebook/dinov3-vitl16-pretrain-lvd1689m and briaai/RMBG-2.0.

Basic usage:

python generate.py photo.png
python generate.py photo.png --seed 123 --output my_model --pipeline-type 512

Limitations

No texture export (meshes export with vertex colors only)
Hole filling disabled (meshes may have small holes)
Slower than CUDA (~10x slower for sparse convolution)
Inference only, no training support

Technical Implementation

The sparse 3D convolution builds a spatial hash of active voxels, gathers neighbor features for each kernel position, applies weights via matrix multiplication, and scatter-adds results back. Mesh extraction reimplements flexible_dual_grid_to_mesh using Python dictionaries instead of CUDA hashmap operations.

Benchmarks on M4 Pro (24GB), pipeline type 512: