Trellis 2 on AMD RX 9070 XT with ROCm 7.11: Setup Guide

Getting Trellis 2 Running on AMD Hardware

A developer has successfully run Trellis 2 on an AMD RX 9070 XT GPU using ROCm 7.11 on Linux Mint 22.3. This addresses common issues where users encountered geometry cutoff, preview failures, and other errors when attempting to run Trellis 2 on AMD hardware.

Key Issues and Solutions

The developer identified two main problems that were causing most failures:

1. ROCm Instability with High N Tensors

ROCm operations become unstable with large tensors, causing overflows or NaN values. The original code in linear.py in the sparse folder used:

def forward(self, input: VarLenTensor) -> VarLenTensor:
    return input.replace(super().forward(input.feats))

The fix implements chunked processing to avoid ROCm issues:

ROCM_SAFE_CHUNK = 524_288
def rocm_safe_linear(feats: torch.Tensor, weight: torch.Tensor, bias=None) -> torch.Tensor:
    """F.linear with ROCm large-N chunking workaround."""
    N = feats.shape[0]
    if N <= ROCM_SAFE_CHUNK:
        return F.linear(feats, weight, bias)
    out = torch.empty(N, weight.shape[0], device=feats.device, dtype=feats.dtype)
    for s in range(0, N, ROCM_SAFE_CHUNK):
        e = min(s + ROCM_SAFE_CHUNK, N)
        out[s:e] = F.linear(feats[s:e], weight, bias)
    return out

def forward(self, input):
    feats = input.feats if hasattr(input, 'feats') else input
    out = rocm_safe_linear(feats, self.weight, self.bias)
    if hasattr(input, 'replace'):
        return input.replace(out)
    return out

2. Broken hipMemcpy2D in CuMesh

The hipMemcpy2D function in CuMesh was causing vertices and faces to drop off or become corrupted. The original CuMesh initialization used:

void CuMesh::init(const torch::Tensor& vertices, const torch::Tensor& faces) {
    size_t num_vertices = vertices.size(0);
    size_t num_faces = faces.size(0);
    this->vertices.resize(num_vertices);
    this->faces.resize(num_faces);
    CUDA_CHECK(cudaMemcpy2D(
        this->vertices.ptr,
        sizeof(float3),
        vertices.data_ptr(),
        sizeof(float) * 3,
        sizeof(float) * 3,
        num_vertices,
        cudaMemcpyDeviceToDevice
    ));
    ...
}

The fix replaces the 2D copy with a 1D version:

CUDA_CHECK(cudaMemcpy(
    this->vertices.ptr,
    vertices.data_ptr(),
    num_vertices * sizeof(float3),
    cudaMemcpyDeviceToDevice
));

Results and Performance

With these fixes, the developer successfully got the image-to-3D pipeline working, including preview rendering (without normals) and final GLB export. On a test image with 21,204 tokens, the process took approximately 280 seconds from start to preview generation. The run used 1024 resolution with all samplers set to 20 steps.

📖 Read the full source: r/LocalLLaMA