Moltbot Gets a Brain

Last weekend Apple Mac Minis sold out everywhere. People are hyped about running local AI assistants. I got hyped too—and went further. Instead of a Mac Mini, I built a home research lab.

What I’m Building

Moltbot is an AI assistant that runs 24/7. It connects through WhatsApp and Slack, remembers context across weeks, and actually does things instead of just answering questions.

I’m setting it up to help run my business:

Process emails, categorize them, draft replies
Pull daily briefings—calendar, tasks, news
Debug code: send it an error, get the fix applied
Manage Git workflows from my phone—commits, PRs, reviews
Run automated monitoring via cron—server status, API health, notifications
Work on problems while I sleep, or work alongside me when I’m awake

The Hardware

Two ASUS Ascent GX10 units—same chip as the NVIDIA DGX Spark, 128GB unified memory each. Daisy-chain them for 256GB as a single pool.

An RTX 5090 laptop running Ubuntu 24.04 is where Moltbot actually lives. That’s my interface—I talk to it through the mic, it responds through speakers. The laptop handles the fast layer while the GX10 cluster handles the heavy thinking.

Around $11.5k total.

What It Runs

On the GX10 cluster (256GB):

MiniMax M2.1 (150GB) — The brain. Handles reasoning, planning, and code.
Qwen2.5-VL-72B (47GB) — Vision. Can see screenshots and read documents.
FLUX 2 Dev (24GB) — Image generation.
Working memory (27GB) — 204k tokens of active context.

Everything loaded at once, no swapping.

On the 5090 laptop (24GB):

Qwen3-30B-A3B (18GB) — Fast responses and simple tasks.
Whisper (3GB) — Speech-to-text so I can talk to it.
TTS (3GB) — Voice output so it can talk back.

The laptop is the frontend. The cluster is the backend. I speak, the laptop transcribes, routes to the cluster for thinking, gets the response, and speaks it back.

The Tiered Brain

Not every task needs the same model. The system routes work based on complexity:

Qwen3-30B on the laptop handles quick responses, routing decisions, simple logic. It’s fast and keeps the interaction feeling snappy.

MiniMax M2.1 on the cluster handles the real thinking—planning, coding, reasoning through problems. It can run 20+ tasks in parallel. This is the default brain for most work.

Opus 4.5 via Claude Code handles the hard stuff—architecture decisions, final code review, debugging when MiniMax gets stuck, research deep dives. It’s the specialist you call in when the problem actually needs it.

This stays within ToS because Opus isn’t the 24/7 brain—MiniMax is. A lot of people got banned trying to run Claude Code as their always-on assistant. That’s not how I’m using it. MiniMax handles the constant work locally. Opus only gets called for specific tasks that actually need it, with pauses between calls, during normal hours. The local stack does the grind. Opus does precision strikes.

The Curve

Open source caught up this year. Llama 3.3 70B runs on a MacBook Pro. DeepSeek-R1 matches GPT-5 on coding benchmarks. Qwen3-Coder became the most downloaded AI system in January.

This is exponential growth. Each improvement builds on the last—community contributors exploring different architectures, new techniques getting implemented within weeks of discovery. What needed a datacenter 18 months ago runs on hardware you can buy.

The infrastructure I’m building now will run better models every six months. I’m not buying for today’s capabilities. I’m buying for the trajectory.

Memory prices are up 40% year over year. The window to build this affordably is closing. But the tech is ready—the models are capable, the agentic systems are mature enough. Good time for early adopters. I’m building now.