Loading...
Hacker News June 28, 2026 12:46 AM

AMD Strix Halo RDMA Cluster Setup Guide

This guide details how to configure a two-node AMD Strix Halo cluster linked via Intel E810 (RoCE v2) for distributed vLLM inference using Tensor Parallelism.

Key Note: The refresh_toolbox.sh script detects your Infiniband/RDMA devices and automatically configures the container to expose them.

To fully utilize the Strix Halo cluster, it is helpful to understand the technologies involved:

Perform these steps on the Host OS (Fedora 43) of both nodes.

Note: These specific kernel versions were verified to work. Fedora 43 is recommended.

Install the core RDMA userspace tools. You do not need proprietary Intel drivers; the in-kernel drivers work perfectly.

Use ethtool to check the current firmware version of your Intel E810 card.

Recommended Firmware: Ensure your firmware is at least as new as the version shown below (Firmware 4.91...). If your firmware is older, please update it using the Intel® Ethernet NVM Update Tool for E810 Series.

This guide assumes a subnet of 192.168.100.0/30.

Identify your interface: Run ip link to find your 100GbE card (e.g., enp194s0np0).

Verify Routing: Ensure the route exists on both:

1. BIOS Settings: Set the iGPU Memory Allocation to the minimum possible (512MB). We will use the GTT (Graphics Translation Table) to dynamically allocate system memory as "Unified Memory" for the GPU.

2. Kernel Parameters: Update GRUB to enable unified memory, optimize RDMA performance, and fix PCI resource allocation.

Edit /etc/default/grub and append to GRUB_CMDLINE_LINUX:

Applications like Ray and NCCL use random high ports. It is easiest to trust the internal RDMA interface completely.

The cluster management and verification scripts rely on SSH to execute commands on remote nodes. You must configure passwordless SSH between both nodes (root or sudo-enabled user).

The toolbox container provided in this repo includes a critical patch: a custom-built librccl.so that enables gfx1151 (Strix Halo) support for RDMA (https://github.com/kyuz0/rocm-systems/tree/gfx1151-rccl), which is currently missing in upstream ROCm packages. This library is automatically compiled using the build-rccl GitHub Action in this repository, which generates the artifact that is then bundled into the Docker container.

Before proceeding to run the cluster, verify that RDMA is active and providing low latency (~5µs vs ~70µs for Ethernet).

Run the provided verification script from the Head Node:

Note the massive latency drop (milliseconds to microseconds) for RDMA.

A TUI utility, start-vllm-cluster, is provided to manage the Ray cluster and vLLM.

Once the cluster is active (checked via Option 3):

If you see link issues, ensure your Intel E810 firmware is up to date using the Intel standard tools.

If you do not have dedicated 100GbE RDMA network cards, you can directly connect the two nodes using a high-quality Thunderbolt 4 / USB4 cable. This will create a thunderbolt0 network interface.

While it lacks the ultra-low microprocessor-level latency of RDMA, it provides significantly more bandwidth than standard 1GbE/5GbE Ethernet and is easier to configure.

Note: thunderbolt-net relies on standard OS kernel TCP/IP stacks.

1. Establish Connection: Connect the nodes directly using a certified Thunderbolt 4 or USB4 cable. Verify the link is active:

2. Network Configuration (Head - Node 1): Configure a persistent connection using nmcli with a static IP and Jumbo Frames (reduces CPU overhead). Note: Jumbo Frames may be unsupported on some Thunderbolt host controllers.

4. Firewall Rules: To ensure Ray and NCCL can communicate freely over this link:

Our cluster scripts dynamically detect the network interface based on the provided IPs. There is no need to manually export environment variables!

I have added Thunderbolt support to the compare_eth_vs_rdma.sh script. Run it from inside the toolbox to see the latency and bandwidth of your Thunderbolt link compared to your other network interfaces.

You can use the -t flag to ONLY benchmark the Thunderbolt connection (or -e, -r, -i for the others):

Share this story: