1.6 Billion Tokens a Week on 120 Watts
Nine agents ran around the clock this week. Most produced garbage on Monday. By Friday, after four debug cycles, the architecture looked completely different.
Nine agents ran around the clock this week. Most of them produced garbage on Monday. By Friday, after four debug cycles, the architecture was unrecognizable.
Getting an agent to complete a single task takes an afternoon. Getting nine of them to run reliably for 16 hours straight takes weeks of iteration. We’re in those weeks right now.
The Debug Cycle
We run agents 8 to 16 hours, then review everything with Opus 4.6. Find the failure patterns. Fix prompt structures, context management, tool use. Relaunch. Four or five of these cycles this week, each one producing real architectural improvements.
Most of the failures that matter don’t show up in short runs. They surface at hour eight, hour twelve, once context windows fill and the work gets complex enough to stress what you built. Short demos hide these problems. Long runs expose them. You need both the length and the volume. Length to find failures, volume to confirm your fixes actually hold.
Agents are non-deterministic. Give one the same input twice and you get different execution paths, different output quality. You can’t write a test suite and call it done. You have to watch them work, over hours, across different inputs, and build an understanding of where they’re solid and where they fall apart. That understanding only comes from runtime.
The Stack
We wrote our own agent runtime. Replaced a 52MB npm framework with Pinecone, 800 lines of Python that does exactly what we need. We were using maybe 5% of that framework and spending more time debugging its abstractions than our own agents. Simpler stack, cleaner debugging, no attack surface from public skill registries nobody’s auditing.
Market Radar runs on it, scraping 22 subreddits, 24 YouTube channels, and 16 RSS feeds around the clock. Nine agents across two product tracks, all on local hardware. Two units drawing 120 watts, running MiniMax M2.5, an open model that beats Claude Sonnet on coding benchmarks and handles tool calling better than Opus. A year ago, open models weren’t close. Now they’re ahead on the capabilities that matter most for agents.
1.6 billion tokens this week. About $400 a week at API rates. We paid for electricity.
Why Local Changes How You Build
When debug cycles cost electricity instead of API calls, you build differently. You stop rationing runs. You stop asking whether an experiment is worth the tokens. You just run it.
We run speculative experiments overnight. We try multi-agent workflows that burn hundreds of thousands of tokens before we know if they’re going anywhere. We let agents attempt things we’re not even sure are possible, because the hardware is sitting there drawing watts whether it’s working or idle.
Over weeks of this, you develop a feel for what agents can handle that you can’t get any other way. You learn which workflows genuinely automate and which ones you assumed would but don’t. Some tasks that agents struggle with in two-hour tests they handle fine over eight hours once they have enough context. Other tasks that look easy in demos fall apart completely at scale. All of this comes from runtime.
None of our agents are productive yet. But the improvement curve is steep. Each debug cycle they’re measurably better, and we’re running four or five a week on hardware that makes those cycles free. We’re sitting at about 60% capacity on this cluster. When the agents are ready, there’s room for 50.
Buy hardware. Start building.