TECHNICAL GUIDE v1.0

STRIX HALO
CLUSTER

A complete hardware and software guide for daisy-chaining AMD Ryzen AI Max+ 395 (Strix Halo) systems over RDMA/RoCE v2 to pool LPDDR5X memory for local LLM inference.

TARGET USELocal LLM inference with pooled unified memory
NODES2 - 8 daisy-chained Strix Halo systems
MEMORY POOLUp to 1 TB pooled LPDDR5X across the cluster

01

HARDWARE

Each node is built around the AMD Ryzen AI Max+ 395, a Strix Halo APU with 16 Zen 5 cores and 40 RDNA 3.5 compute units sharing a unified LPDDR5X memory pool over a 256-bit bus. The cluster interconnects via RDMA over Converged Ethernet (RoCE v2).


Component Requirements

ComponentSpecificationNotes
APUAMD Ryzen AI Max+ 39516C/32T Zen 5, 40 CU RDNA 3.5
MemoryLPDDR5X-8000Up to 128 GB per node, 256-bit bus
NICMellanox ConnectX-5 / Intel E81025/100 GbE, RoCE v2 support
AdapterOCuLink PCIe 4.0 x4External GPU/NIC connectivity
StorageNVMe M.2 Gen 5OS + model weights

Topology

Nodes connect in a daisy-chain topology. Each node's NIC bridges to the next via Direct Attach Copper (DAC) or fiber. The chain terminates at the first and last nodes; no central switch is required for a 2-4 node cluster, though a leaf-spine topology is recommended beyond 4 nodes.

[Node 1] --- NIC --- DAC --- NIC --- [Node 2] --- NIC --- DAC --- [Node N]
    │                                               │
    └──────── OCuLink PCIe 4.0 x4 ──────────────┘

# For clusters >4 nodes, use a dedicated RoCE switch:
[Node 1] ─── NIC ───┐
[Node 2] ─── NIC ───┤
[Node N] ─── NIC ───┘─── RoCE Switch ─── Uplink

Memory Topology

Each Strix Halo node exposes 96 GB of its local LPDDR5X pool to remote peers via RDMA. With 8 nodes, the cluster presents a unified 768 GB memory region visible to all participating processes. Memory is coherent at the application level - the distributed inference framework manages page placement and migration.


02

SOFTWARE STACK

The software stack spans four layers: the host OS, the ROCm compute platform, the RDMA/runtime layer, and the inference framework.


OS & Drivers

Ubuntu 24.04 LTS is the reference platform. The amdgpu kernel module and ROCm DKMS packages handle GPU compute; the MLNX_OFED or ice-rdma driver stack provides the RDMA transport.

# Ubuntu 24.04 LTS - install ROCm 6.3
wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/noble/amdgpu-install_6.3.60300-1_all.deb
sudo dpkg -i amdgpu-install_6.3.60300-1_all.deb
sudo amdgpu-install --usecase=rocm

# Verify ROCm installation
sudo rocminfo | grep -i "gfx"
/opt/rocm/bin/rocm-smi

Variable Graphics Memory

Strix Halo allows the system BIOS to carve a variable portion of LPDDR5X as dedicated GPU memory. Set this to the maximum available value (typically 96 GB) to expose the full pool to ROCm.

# BIOS settings (varies by motherboard vendor)
# 1. Enter BIOS setup (DEL/F2 at boot)
# 2. Advanced > AMD CBS > NBIO Common Options
# 3. GFX Configuration > iGPU Configuration > Enabled
# 4. UMA Frame Buffer Size > 96G (or maximum)
# 5. BDAT > ACPI GFX Table > Enabled
# 6. Save & Exit

# Verify available GPU memory
sudo rocm-smi --showmeminfo vram

RDMA / RoCE v2 Stack

RoCE v2 encapsulates RDMA traffic in UDP/IP packets, enabling lossless, kernel-bypass memory access between nodes. Configure the NIC with explicit congestion notification (ECN) and priority flow control (PFC) for reliable transport.

# Install Mellanox OFED (ConnectX-5)
wget https://content.mellanox.com/ofed/MLNX_OFED-24.10/MLNX_OFED_LINUX-24.10-ubuntu24.04-x86_64.tgz
tar xzf MLNX_OFED_LINUX-24.10-*.tgz
cd MLNX_OFED_LINUX-24.10-*/
sudo ./mlnxofedinstall --force

# Or for Intel E810 (ice driver)
sudo apt install ice-rdma
sudo modprobe irdma

# Verify RDMA devices
ibstat
ibv_devinfo

03

RDMA / RoCE v2

RoCE v2 configuration requires multipath routing between nodes and lossless Ethernet fabric setup. This section covers both single-NIC and multi-NIC topologies.


Network Configuration

Each node must have static IPs for its RDMA interfaces and a routing table that maps remote memory regions to the correct NIC port.

# /etc/netplan/01-rdma.yaml
network:
  version: 2
  ethernets:
    rdma0:
      match:
        driver: mlx5_core
      mtu: 9000
      addresses:
        - 10.0.1.1/24
    rdma1:
      match:
        driver: mlx5_core
      mtu: 9000
      addresses:
        - 10.0.2.1/24

# Apply netplan
sudo netplan apply

# Configure RoCE on each port (Mellanox)
sudo mlxreglim -d /dev/mst/mt4125_pciconf0 --reg_name PFC_CONTROL --set "local_port=1,action=1,tc=3"
sudo mlnx_qos -i ens1f0 --pfc 0,0,0,1,0,0,0,0
sudo mlnx_qos -i ens1f0 --trust dscp

# Enable ECN
sudo echo 1 > /sys/class/net/ens1f0/ecn/roce_enable
sudo echo 2 > /sys/class/net/ens1f0/ecn/roce_tos

Verify RDMA Connectivity

# Test RDMA link (run on all nodes)
ibping -S 10.0.1.1 -C mlx5_0 -P 1

# Bandwidth test (server on node 1, client on node 2)
# Server:
ib_write_bw -d mlx5_0 -p 18515 --report_gbits
# Client:
ib_write_bw -d mlx5_0 -p 18515 10.0.1.1 --report_gbits

# Expected output:
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# #bytes #iterations BW peak[Gb/sec]
# 65536  5000         92.34
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

GID Index Configuration

vLLM and Ray require the correct GID index for RoCE v2. Set the active GID to the IPv4 entry.

# List GID entries
show_gids mlx5_0

# Expected:
# DEV     PORT   INDEX   GID                                     IPv4
# mlx5_0  1      1       fe80::...                               link
# mlx5_0  1      3       10.0.1.1                                v2

# Set the default GID index for this device
sudo cma_roce_tos -d mlx5_0 -t 106

# Persist in /etc/rdma/mlx5.conf
echo 'options mlx5_core roce_tos=106' | sudo tee /etc/modprobe.d/mlx5-roce.conf

04

DISTRIBUTED INFERENCE

Two primary paths for distributed inference across the cluster: vLLM with Ray for high-throughput serving, and llama.cpp with MPI or RPC for research and local usage.


vLLM with Ray

Ray provides the distributed scheduler. vLLM uses tensor parallelism across nodes, sharding each transformer layer across the pooled memory of the cluster.

# Install vLLM with ROCm support
pip install vllm[rocm] ray[default]

# Start Ray head node (node 1)
ray start --head --port=6379 --num-gpus=1
# Expected: Ray runtime started, dashboard at http://10.0.1.1:8265

# Join worker nodes to the cluster
ray start --address=10.0.1.1:6379 --num-gpus=1

# Launch vLLM with tensor parallelism across 4 nodes
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --distributed-executor-backend ray \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --dtype bfloat16

# Verify all GPUs visible
vllm serve NousResearch/Hermes-3-Llama-3.1-70B \
  --tensor-parallel-size 4 \
  --distributed-executor-backend ray \
  --enforce-eager

Monitor Ray Cluster

# Check cluster status
ray status

# Expected:
# ====== Autoscaler status ======
# Node status
# ---------------------------------------------------
# Healthy:
#  1 node_10.0.1.1
#  1 node_10.0.2.1
#  1 node_10.0.3.1
#  1 node_10.0.4.1
# Resources:
#  Total: 4.0 GPU, 256.0 CPU, 4.0 object_store_memory

# View dashboard (browser)
# http://10.0.1.1:8265

llama.cpp with MPI

llama.cpp with OpenMPI distributes model layers across nodes using the ROCm backend. The MPI implementation uses RDMA for fast collective communication.

# Build llama.cpp with ROCm and MPI support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_HIP=ON \
  -DGGML_HIP_PLATFORM=amd \
  -DGGHL_MPI=ON \
  -DGPU_TARGETS=gfx1100
cmake --build build -j$(nproc)

# Verify MPI support
./build/bin/llama-cli --help | grep mpi
# Expected: --mpi-run (enable MPI distribution)

# Run across 4 nodes
# On the host that holds the model:
mpirun -hostfile hostfile -np 4 \
  ./build/bin/llama-cli \
  --mpi-run \
  --model /models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --temp 0.7 \
  --ctx-size 32768 \
  --n-gpu-layers 999

# hostfile (one entry per node):
# 10.0.1.1 slots=1
# 10.0.2.1 slots=1
# 10.0.3.1 slots=1
# 10.0.4.1 slots=1

llama.cpp with RPC

The RPC backend is an alternative to MPI, using a server-client architecture where each node exposes its compute via TCP.

# Build llama.cpp with RPC support
cmake -B build \
  -DGGML_HIP=ON \
  -DGGML_HIP_PLATFORM=amd \
  -DGGML_RPC=ON \
  -DGPU_TARGETS=gfx1100
cmake --build build -j$(nproc)

# Start RPC servers on each worker node
./build/bin/llama-rpc-server \
  --host 0.0.0.0 \
  --port 50052 \
  --n-gpu-layers 999

# Launch the main process pointing to all RPC servers
./build/bin/llama-cli \
  --model /models/llama-3.3-70b.Q4_K_M.gguf \
  --rpc "10.0.1.1:50052,10.0.2.1:50052,10.0.3.1:50052,10.0.4.1:50052" \
  --temp 0.7 \
  --ctx-size 32768

05

RESOURCES

Official documentation and source repositories for the core components of the Strix Halo Cluster stack.