4 aarch64 vLLM Failures on Blackwell GB10 CUDA 13.0

Setup and environment

The setup uses GB10 hardware with aarch64 (sbsa-linux), Python 3.12, CUDA 13.0, and vLLM v0.7.1. The issues emerged in a daily-reset test environment and are specific to aarch64 with CUDA 13.0.

Failure mode 1: cu121 wheel doesn't exist for aarch64

Using the --index-url .../cu121 protocol returns: ERROR: Could not find a version that satisfies the requirement torch (from versions: none). The cu121 index has no aarch64 binary. The correct index for Blackwell aarch64 is cu130.

sudo pip3 install --pre torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/nightly/cu130 \
--break-system-packages

Failure mode 2: ncclWaitSignal undefined symbol

After installing cu130 torch, importing fails with: ImportError: libtorch_cuda.so: undefined symbol: ncclWaitSignal. The apt-installed NCCL doesn't have this symbol, but pip-installed nvidia-nccl-cu13 does. The linker doesn't find it automatically.

Fix: Force it via LD_PRELOAD before every Python call:

export LD_PRELOAD=/usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2

Failure mode 3: numa.h not found during vLLM CPU extension build

The error: fatal error: numa.h: No such file or directory. vLLM's CPU extension requires libnuma-dev, which wasn't installed on the reset system.

sudo apt-get install -y libnuma-dev

Failure mode 4: ABI mismatch — MessageLogger undefined symbol

After completing the full build, launching vLLM fails with: ImportError: vllm/_C.abi3.so: undefined symbol: _ZN3c1013MessageLoggerC1EPKciib.

Diagnosis with nm shows:

What vLLM binary expected (old signature): U _ZN3c1013MessageLoggerC1EPKciib ← (const char*, int, int, bool)
What the cu130 torch library actually provides (new signature): T _ZN3c1013MessageLoggerC1ENS_14SourceLocationEib ← (SourceLocation, int, bool)

Root cause: pip's build isolation. When running pip install -e ., pip creates an isolated build environment and downloads a separate older torch based on pyproject.toml version constraints. vLLM compiles against those old headers, but at runtime the newer cu130 torch is found, causing signature mismatch.

Fix: Use --no-build-isolation with explicit subprocess injection:

sudo -E env \
LD_PRELOAD="/usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2" \
LD_LIBRARY_PATH="/usr/local/lib/python3.12/dist-packages/torch/lib:..." \
MAX_JOBS=8 \
pip3 install -e . --no-deps --no-build-isolation --break-system-packages

Important detail: sudo -E alone doesn't work because pip's subprocess chain doesn't carry LD_PRELOAD. You need sudo -E env VAR=value pip3 to inject into the subprocess explicitly.

Verify the ABI seal after installation:

nm -D vllm/_C.abi3.so | grep MessageLogger
# Must contain "SourceLocation" — if it still says "EPKciib", reinstall

Additional note for multi-agent systems

If using vLLM as a backend for a multi-agent system, add --served-model-name your-model-name. Without it, vLLM serves the model under its full file path and agents get 404 when they query by name.

The full v2 protocol, including automation script and systemd service, is available at github.com/trgysvc/AutonomousNativeForge → docs/BLACKWELL_SETUP_V2.md. The repo is for ANF — a 4-agent autonomous coding pipeline running on top of this setup, but the setup docs stand alone if you just need the Blackwell/vLLM fixes.

📖 Read the full source: r/LocalLLaMA