February 4, 2026

First Light on the GX10 Cluster

Getting a 139B model running across two desktop AI boxes.

The GX10 units arrived last week. Two compact boxes, 128GB unified memory each, sitting on my desk ready to run a 139 billion parameter model. I figured I’d have it running in an afternoon.

It took three days.

Day one was just getting each device set up individually—Ubuntu installed, drivers configured, Docker running. I tested each unit with FLUX 2 to make sure the hardware actually worked. About 2-3 minutes per image at full quality—I let it run overnight and woke up to 300 images. Good sign the hardware works, but image generation isn’t why I bought these.

Days two and three were Claude Code grinding through configurations, failed downloads, CUDA version mismatches, and container debugging. The better part of a weekend gone.

The Plan

In my last post I outlined the Moltbot hardware stack: two ASUS Ascent GX10 units daisy-chained for 256GB of unified memory, running MiniMax M2.1 as the reasoning brain. The theory was simple—vLLM supports distributed inference, Ray handles the cluster coordination, the model fits in memory. Should just work.

The GX10 is interesting hardware. It’s basically NVIDIA’s answer to “what if we put a Blackwell chip in a box you can put on a shelf?” Same GB10 silicon as the DGX Spark, 128GB unified memory where the GPU shares the system RAM. No separate VRAM to worry about. Four ConnectX-7 ports for 200G networking. It’s datacenter hardware shrunk down to something that fits next to your router.

What Actually Happened

I started with NVIDIA’s official vLLM container. Seemed like the obvious choice—it’s their hardware, their container, should be optimized. Pulled the image, launched it, pointed it at MiniMax M2.1.

KeyError: 'MiniMaxM2ForCausalLM'

The container ships with vLLM 0.11.0. MiniMax support landed in 0.12. The official NVIDIA container was three versions behind.

No problem, I’ll just upgrade vLLM inside the container. Claude Code started working through the options.

Tried the nightly wheel—failed because the wheel was built for CUDA 12 and the container runs CUDA 13. Tried the CUDA 13 wheel—failed because PyTorch in the container is version 2.10.0a0 and the wheel expected a different ABI. Tried building from source—failed because the container doesn’t include the CUDA headers needed for compilation.

Four different approaches, four different failures. Each one took an hour or more to fully play out—download the wheel, watch it install, hit the error, debug, try the next thing. Hours of watching terminals scroll.

The Fix That Worked

Community Docker images. Someone named gogamza built an image specifically for GB10 hardware with a recent vLLM version. Pulled it, launched it, MiniMax loaded immediately.

INFO: Loading model 'MiniMaxM2ForCausalLM'

That’s the thing about cutting-edge hardware. The official containers lag behind. Community builds often have better support because someone actually needed to solve the problem.

The Model Download Saga

With vLLM sorted, I needed the actual model weights. MiniMax M2.1 is 131GB in BF16 format. I started the download.

Six hours later, I realized I’d made a mistake.

I’d been confused about quantization formats. Ollama uses Q5 GGUF files. vLLM uses safetensors with FP16, FP8, or INT4 quantization. Different ecosystems, different file formats. I’d started downloading a version that wasn’t going to work with my setup.

Cancelled the download. Started over with the REAP variant from Cerebras.

REAP—Router-weighted Expert Activation Pruning—is a clever optimization. MiniMax M2.1 is a mixture-of-experts model, meaning only a fraction of the parameters activate for any given token. REAP prunes 40% of the expert weights that rarely get used, shrinking the model while keeping nearly identical performance. HumanEval drops from 94.5 to 93.9. Barely measurable, but 40% smaller.

The REAP model runs at full BF16 precision. No quantization compromises. I’m curious whether FP8 quantization would boost throughput—benchmarks suggest 30-45% improvement—but without the 200G cable installed, network bandwidth is the bottleneck anyway. No point optimizing compute when I’m waiting on packets. That test comes later.

Another 6-8 hours of downloading. Both nodes needed their own copy of the weights.

The Next Three Problems

Getting the model to load was step one. Getting it to actually run across both nodes took three more fixes.

Problem: PYTHONPATH

The community image had vLLM installed in a non-standard location. Python found it as a namespace package instead of the actual module.

ImportError: cannot import name 'SamplingParams' from 'vllm'

Fix: PYTHONPATH=/workspace/vllm as an environment variable. Took an hour to figure out.

Problem: Ray OOM During Warmup

vLLM uses torch.compile to optimize the model. On unified memory systems, this warmup phase temporarily spikes memory usage above what you’d expect from the model size alone. Ray has an aggressive out-of-memory killer that triggers at 95% memory usage.

ray.exceptions.OutOfMemoryError: Memory on node was 113.77GB / 119.64GB

The model was fine. The temporary compilation overhead wasn’t. Fix: disable Ray’s memory monitor with RAY_memory_monitor_refresh_ms=0 and drop GPU memory utilization from 90% to 85%.

Problem: Network Interface

Ray and NCCL need to know which network interface to use for inter-node communication. They default to localhost, which doesn’t help when you’re trying to connect two machines.

Fix: explicitly set GLOO_SOCKET_IFNAME=enP7s7 and NCCL_SOCKET_IFNAME=enP7s7 to point at the actual ethernet interface.

The Working Configuration

For anyone else trying this, here’s what actually works:

Docker image: gogamza/unsloth-vllm-gb10:latest

Ray head node:

docker run -d --name ray-head \
  --gpus all --net=host --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_HOST_IP=<node1_ip> \
  -e GLOO_SOCKET_IFNAME=<your_interface> \
  -e NCCL_SOCKET_IFNAME=<your_interface> \
  -e PYTHONPATH=/workspace/vllm \
  -e RAY_memory_monitor_refresh_ms=0 \
  gogamza/unsloth-vllm-gb10:latest \
  ray start --head --port=6379 --num-gpus=1 --block

Ray worker node:

docker run -d --name ray-worker \
  --gpus all --net=host --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_HOST_IP=<node2_ip> \
  -e GLOO_SOCKET_IFNAME=<your_interface> \
  -e NCCL_SOCKET_IFNAME=<your_interface> \
  -e PYTHONPATH=/workspace/vllm \
  -e RAY_memory_monitor_refresh_ms=0 \
  gogamza/unsloth-vllm-gb10:latest \
  ray start --address=<node1_ip>:6379 --num-gpus=1 --block

vLLM server:

vllm serve cerebras/MiniMax-M2.1-REAP-139B-A10B \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --max-model-len 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85

Total startup time once everything is configured: about 12 minutes. Six minutes to load model weights, two minutes for torch.compile, two minutes for CUDA graph capture.

Initial Benchmarks

With the cluster running, I threw 17 concurrent requests at it—the maximum that fits in the KV cache at 32k context length.

MetricValue
Concurrent requests17
Aggregate throughput42 tokens/sec
Per-request throughput2.5 tokens/sec
Model parameters139B total, 10B active

For comparison, OpenRouter serves MiniMax M2.1 at 30-40 tokens/sec for a single request. I’m getting 42 tokens/sec aggregate across 17 parallel requests. The per-request speed is slow, but the parallel throughput is real.

The bottleneck is obvious: I’m running tensor parallelism over 10G Ethernet. Every token generation requires synchronization across both nodes—124 times per token, once per transformer layer. At 10G speeds (1.25 GB/s), that synchronization dominates everything else.

Why This Matters

The point of this cluster isn’t raw speed. It’s concurrency without token costs.

I want to run agent swarms—multiple AI agents working in parallel on different parts of a problem. Research agents, coding agents, review agents, all coordinating through a local orchestrator. At API prices, that gets expensive fast. Run 17 agents for an hour generating 10k tokens each and you’re looking at real money.

With local inference, the tokens are free. The $6k hardware cost is a one-time payment. After that, I can let agents think as long as they need to, explore dead ends, iterate on solutions—no meter running. The economics of local compute enable workflows that would be wasteful at API prices.

MiniMax M2.1 is good enough for most agent tasks. If something better comes along that’s optimized for high parallelism, I can swap it in. The hardware is the platform. The models are interchangeable.

The Cable

The GX10 has four ConnectX-7 ports that support 200G ethernet. I ordered a 200G QSFP56 cable to connect them directly. The cable is arriving tomorrow.

The math suggests 200G should give me 3-4x the current throughput. Instead of 42 tokens/sec aggregate, something like 120-170 tokens/sec. That would put per-request speed at 7-10 tokens/sec even with 17 concurrent users—actually usable for interactive work.

There’s something absurd about $6,000 of hardware sitting on my desk, bottlenecked by a $200 cable that’s still in transit. But that’s the reality of building infrastructure at the edge. The big pieces are easy to buy. The small pieces that connect them take longer to ship.

What I Learned

Community images beat official containers for new hardware. The NVIDIA container was three vLLM versions behind. The community build had exactly what I needed.

Unified memory changes the math. The GX10’s architecture means model weights and KV cache share the same memory pool. Compilation overhead that would be invisible on a system with separate VRAM becomes a problem when everything competes for the same RAM.

Network configuration isn’t optional. Distributed inference requires explicit network interface configuration. The defaults assume localhost.

139B parameters fit on two $3k boxes. The REAP-compressed MiniMax M2.1 runs comfortably with room for 17 concurrent 32k-token contexts. A year ago this would have required renting datacenter GPUs.

Local inference changes the economics. When tokens are free, you can let agents think longer, explore more options, run more iterations. Workflows that would be wasteful at API prices become practical.

The cluster is running. The model is serving requests. Seventeen agents can work in parallel on my desk. Now I’m just waiting on a cable to see how fast they can actually go.

First Light on the GX10 Cluster
0:00
0:00