STRIX HALO
CLUSTER
A complete hardware and software guide for daisy-chaining AMD Ryzen AI Max+ 395 (Strix Halo) systems over RDMA/RoCE v2 to pool LPDDR5X memory for local LLM inference.
HARDWARE
Each node is built around the AMD Ryzen AI Max+ 395, a Strix Halo APU with 16 Zen 5 cores and 40 RDNA 3.5 compute units sharing a unified LPDDR5X memory pool over a 256-bit bus. The cluster interconnects via RDMA over Converged Ethernet (RoCE v2).
Component Requirements
| Component | Specification | Notes |
|---|---|---|
| APU | AMD Ryzen AI Max+ 395 | 16C/32T Zen 5, 40 CU RDNA 3.5 |
| Memory | LPDDR5X-8000 | Up to 128 GB per node, 256-bit bus |
| NIC | Mellanox ConnectX-5 / Intel E810 | 25/100 GbE, RoCE v2 support |
| Adapter | OCuLink PCIe 4.0 x4 | External GPU/NIC connectivity |
| Storage | NVMe M.2 Gen 5 | OS + model weights |
Topology
Nodes connect in a daisy-chain topology. Each node's NIC bridges to the next via Direct Attach Copper (DAC) or fiber. The chain terminates at the first and last nodes; no central switch is required for a 2-4 node cluster, though a leaf-spine topology is recommended beyond 4 nodes.
[Node 1] --- NIC --- DAC --- NIC --- [Node 2] --- NIC --- DAC --- [Node N]
│ │
└──────── OCuLink PCIe 4.0 x4 ──────────────┘
# For clusters >4 nodes, use a dedicated RoCE switch:
[Node 1] ─── NIC ───┐
[Node 2] ─── NIC ───┤
[Node N] ─── NIC ───┘─── RoCE Switch ─── UplinkMemory Topology
Each Strix Halo node exposes 96 GB of its local LPDDR5X pool to remote peers via RDMA. With 8 nodes, the cluster presents a unified 768 GB memory region visible to all participating processes. Memory is coherent at the application level - the distributed inference framework manages page placement and migration.
SOFTWARE STACK
The software stack spans four layers: the host OS, the ROCm compute platform, the RDMA/runtime layer, and the inference framework.
OS & Drivers
Ubuntu 24.04 LTS is the reference platform. The amdgpu kernel module and ROCm DKMS packages handle GPU compute; the MLNX_OFED or ice-rdma driver stack provides the RDMA transport.
# Ubuntu 24.04 LTS - install ROCm 6.3
wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/noble/amdgpu-install_6.3.60300-1_all.deb
sudo dpkg -i amdgpu-install_6.3.60300-1_all.deb
sudo amdgpu-install --usecase=rocm
# Verify ROCm installation
sudo rocminfo | grep -i "gfx"
/opt/rocm/bin/rocm-smiVariable Graphics Memory
Strix Halo allows the system BIOS to carve a variable portion of LPDDR5X as dedicated GPU memory. Set this to the maximum available value (typically 96 GB) to expose the full pool to ROCm.
# BIOS settings (varies by motherboard vendor)
# 1. Enter BIOS setup (DEL/F2 at boot)
# 2. Advanced > AMD CBS > NBIO Common Options
# 3. GFX Configuration > iGPU Configuration > Enabled
# 4. UMA Frame Buffer Size > 96G (or maximum)
# 5. BDAT > ACPI GFX Table > Enabled
# 6. Save & Exit
# Verify available GPU memory
sudo rocm-smi --showmeminfo vramRDMA / RoCE v2 Stack
RoCE v2 encapsulates RDMA traffic in UDP/IP packets, enabling lossless, kernel-bypass memory access between nodes. Configure the NIC with explicit congestion notification (ECN) and priority flow control (PFC) for reliable transport.
# Install Mellanox OFED (ConnectX-5)
wget https://content.mellanox.com/ofed/MLNX_OFED-24.10/MLNX_OFED_LINUX-24.10-ubuntu24.04-x86_64.tgz
tar xzf MLNX_OFED_LINUX-24.10-*.tgz
cd MLNX_OFED_LINUX-24.10-*/
sudo ./mlnxofedinstall --force
# Or for Intel E810 (ice driver)
sudo apt install ice-rdma
sudo modprobe irdma
# Verify RDMA devices
ibstat
ibv_devinfoRDMA / RoCE v2
RoCE v2 configuration requires multipath routing between nodes and lossless Ethernet fabric setup. This section covers both single-NIC and multi-NIC topologies.
Network Configuration
Each node must have static IPs for its RDMA interfaces and a routing table that maps remote memory regions to the correct NIC port.
# /etc/netplan/01-rdma.yaml
network:
version: 2
ethernets:
rdma0:
match:
driver: mlx5_core
mtu: 9000
addresses:
- 10.0.1.1/24
rdma1:
match:
driver: mlx5_core
mtu: 9000
addresses:
- 10.0.2.1/24
# Apply netplan
sudo netplan apply
# Configure RoCE on each port (Mellanox)
sudo mlxreglim -d /dev/mst/mt4125_pciconf0 --reg_name PFC_CONTROL --set "local_port=1,action=1,tc=3"
sudo mlnx_qos -i ens1f0 --pfc 0,0,0,1,0,0,0,0
sudo mlnx_qos -i ens1f0 --trust dscp
# Enable ECN
sudo echo 1 > /sys/class/net/ens1f0/ecn/roce_enable
sudo echo 2 > /sys/class/net/ens1f0/ecn/roce_tosVerify RDMA Connectivity
# Test RDMA link (run on all nodes)
ibping -S 10.0.1.1 -C mlx5_0 -P 1
# Bandwidth test (server on node 1, client on node 2)
# Server:
ib_write_bw -d mlx5_0 -p 18515 --report_gbits
# Client:
ib_write_bw -d mlx5_0 -p 18515 10.0.1.1 --report_gbits
# Expected output:
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# #bytes #iterations BW peak[Gb/sec]
# 65536 5000 92.34
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~GID Index Configuration
vLLM and Ray require the correct GID index for RoCE v2. Set the active GID to the IPv4 entry.
# List GID entries
show_gids mlx5_0
# Expected:
# DEV PORT INDEX GID IPv4
# mlx5_0 1 1 fe80::... link
# mlx5_0 1 3 10.0.1.1 v2
# Set the default GID index for this device
sudo cma_roce_tos -d mlx5_0 -t 106
# Persist in /etc/rdma/mlx5.conf
echo 'options mlx5_core roce_tos=106' | sudo tee /etc/modprobe.d/mlx5-roce.confDISTRIBUTED INFERENCE
Two primary paths for distributed inference across the cluster: vLLM with Ray for high-throughput serving, and llama.cpp with MPI or RPC for research and local usage.
vLLM with Ray
Ray provides the distributed scheduler. vLLM uses tensor parallelism across nodes, sharding each transformer layer across the pooled memory of the cluster.
# Install vLLM with ROCm support
pip install vllm[rocm] ray[default]
# Start Ray head node (node 1)
ray start --head --port=6379 --num-gpus=1
# Expected: Ray runtime started, dashboard at http://10.0.1.1:8265
# Join worker nodes to the cluster
ray start --address=10.0.1.1:6379 --num-gpus=1
# Launch vLLM with tensor parallelism across 4 nodes
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--distributed-executor-backend ray \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--dtype bfloat16
# Verify all GPUs visible
vllm serve NousResearch/Hermes-3-Llama-3.1-70B \
--tensor-parallel-size 4 \
--distributed-executor-backend ray \
--enforce-eagerMonitor Ray Cluster
# Check cluster status
ray status
# Expected:
# ====== Autoscaler status ======
# Node status
# ---------------------------------------------------
# Healthy:
# 1 node_10.0.1.1
# 1 node_10.0.2.1
# 1 node_10.0.3.1
# 1 node_10.0.4.1
# Resources:
# Total: 4.0 GPU, 256.0 CPU, 4.0 object_store_memory
# View dashboard (browser)
# http://10.0.1.1:8265llama.cpp with MPI
llama.cpp with OpenMPI distributes model layers across nodes using the ROCm backend. The MPI implementation uses RDMA for fast collective communication.
# Build llama.cpp with ROCm and MPI support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_HIP=ON \
-DGGML_HIP_PLATFORM=amd \
-DGGHL_MPI=ON \
-DGPU_TARGETS=gfx1100
cmake --build build -j$(nproc)
# Verify MPI support
./build/bin/llama-cli --help | grep mpi
# Expected: --mpi-run (enable MPI distribution)
# Run across 4 nodes
# On the host that holds the model:
mpirun -hostfile hostfile -np 4 \
./build/bin/llama-cli \
--mpi-run \
--model /models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--temp 0.7 \
--ctx-size 32768 \
--n-gpu-layers 999
# hostfile (one entry per node):
# 10.0.1.1 slots=1
# 10.0.2.1 slots=1
# 10.0.3.1 slots=1
# 10.0.4.1 slots=1llama.cpp with RPC
The RPC backend is an alternative to MPI, using a server-client architecture where each node exposes its compute via TCP.
# Build llama.cpp with RPC support
cmake -B build \
-DGGML_HIP=ON \
-DGGML_HIP_PLATFORM=amd \
-DGGML_RPC=ON \
-DGPU_TARGETS=gfx1100
cmake --build build -j$(nproc)
# Start RPC servers on each worker node
./build/bin/llama-rpc-server \
--host 0.0.0.0 \
--port 50052 \
--n-gpu-layers 999
# Launch the main process pointing to all RPC servers
./build/bin/llama-cli \
--model /models/llama-3.3-70b.Q4_K_M.gguf \
--rpc "10.0.1.1:50052,10.0.2.1:50052,10.0.3.1:50052,10.0.4.1:50052" \
--temp 0.7 \
--ctx-size 32768RESOURCES
Official documentation and source repositories for the core components of the Strix Halo Cluster stack.