Trellis 2 Successfully Running on ROCm 7.11 with AMD RX 9070 XT

Getting Trellis 2 Running on AMD Hardware
A developer has successfully run Trellis 2 on an AMD RX 9070 XT GPU using ROCm 7.11 on Linux Mint 22.3. This addresses common issues where users encountered geometry cutoff, preview failures, and other errors when attempting to run Trellis 2 on AMD hardware.
Key Issues and Solutions
The developer identified two main problems that were causing most failures:
1. ROCm Instability with High N Tensors
ROCm operations become unstable with large tensors, causing overflows or NaN values. The original code in linear.py in the sparse folder used:
def forward(self, input: VarLenTensor) -> VarLenTensor:
return input.replace(super().forward(input.feats))The fix implements chunked processing to avoid ROCm issues:
ROCM_SAFE_CHUNK = 524_288
def rocm_safe_linear(feats: torch.Tensor, weight: torch.Tensor, bias=None) -> torch.Tensor:
"""F.linear with ROCm large-N chunking workaround."""
N = feats.shape[0]
if N <= ROCM_SAFE_CHUNK:
return F.linear(feats, weight, bias)
out = torch.empty(N, weight.shape[0], device=feats.device, dtype=feats.dtype)
for s in range(0, N, ROCM_SAFE_CHUNK):
e = min(s + ROCM_SAFE_CHUNK, N)
out[s:e] = F.linear(feats[s:e], weight, bias)
return out
def forward(self, input):
feats = input.feats if hasattr(input, 'feats') else input
out = rocm_safe_linear(feats, self.weight, self.bias)
if hasattr(input, 'replace'):
return input.replace(out)
return out
2. Broken hipMemcpy2D in CuMesh
The hipMemcpy2D function in CuMesh was causing vertices and faces to drop off or become corrupted. The original CuMesh initialization used:
void CuMesh::init(const torch::Tensor& vertices, const torch::Tensor& faces) {
size_t num_vertices = vertices.size(0);
size_t num_faces = faces.size(0);
this->vertices.resize(num_vertices);
this->faces.resize(num_faces);
CUDA_CHECK(cudaMemcpy2D(
this->vertices.ptr,
sizeof(float3),
vertices.data_ptr(),
sizeof(float) * 3,
sizeof(float) * 3,
num_vertices,
cudaMemcpyDeviceToDevice
));
...
} The fix replaces the 2D copy with a 1D version:
CUDA_CHECK(cudaMemcpy(
this->vertices.ptr,
vertices.data_ptr(),
num_vertices * sizeof(float3),
cudaMemcpyDeviceToDevice
)); Results and Performance
With these fixes, the developer successfully got the image-to-3D pipeline working, including preview rendering (without normals) and final GLB export. On a test image with 21,204 tokens, the process took approximately 280 seconds from start to preview generation. The run used 1024 resolution with all samplers set to 20 steps.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Practical Review: 3 Essential Clawhub Skills and 3 to Avoid
A developer tested Clawhub skills for weeks and found three worth installing: web-search (Brave), daily-brief, and memory-search. Three others—food-order, multi-agent orchestrators, and humanizer—waste tokens and add unnecessary complexity.

Claude Code O365 MCP Conditional Access Setup Issues and Solutions
A developer shares specific solutions for two problems encountered when setting up Claude Code's O365 MCP connector under conditional access policies: finding the correct application IDs for policy rules and resolving authentication errors related to server locations.

Free OpenClaw Gateway with Local LLM on Oracle Cloud
A developer shares how to run OpenClaw Gateway with a local Qwen3.5 27B A3B 4-bit LLM on Oracle Cloud's free tier using a VM.Standard.A2.Flex instance with 4 OCPUs, 24GB RAM, and 200GB SSD, managed remotely via the QCAI app.

iOS Developer Shares Claude Code Best Practices After Shipping Multiple Apps
An iOS developer with cybersecurity background outlines specific practices for using Claude Code effectively, including environment separation, observability setup, and avoiding technical debt accumulation.