January 3, 2026

How We Cut LLM Costs 75% With a 2-Tier Architecture

Cheap models filter. Expensive models generate.

We were burning through Claude API credits too fast.

Our social media tool analyzes tweets and generates replies. Every tweet went through Sonnet with extended thinking—deep reasoning, tool use, the works. Great quality. Expensive.

The math didn’t work. 20 tweets per run, ~$0.20 each run, multiple runs per day. Most tweets got rejected anyway—off-topic, low-quality, not worth engaging. We were paying for Sonnet to think deeply about tweets it would ultimately skip.

The Fix: Two Tiers

Simple idea: use a cheap model to filter, expensive model only for winners.

Before:
  20 tweets → Sonnet ($$) → 3 replies

After:
  20 tweets → Haiku ($) → 5 candidates → Sonnet ($$) → 3 replies

Tier 1: Haiku 4.5 - Fast, cheap ($0.80/M tokens). One job: REPLY or SKIP. No extended thinking, no tools, just a quick judgment call.

Tier 2: Sonnet 4.5 - The full pipeline. Extended thinking, web search tools, careful reasoning. Only runs on tweets that passed the filter.

The Numbers

Haiku filtering 20 tweets: ~$0.01

Sonnet processing 5 filtered tweets: ~$0.04

Total: ~$0.05 per run vs ~$0.20 before. 75% reduction.

And the quality didn’t drop. Haiku is good enough to spot obvious skips—off-topic content, low-quality RTs, private conversations. It catches ~75% of the junk before Sonnet ever sees it.

Implementation

The filter is dead simple:

class TweetFilter:
    def filter_tweet(self, tweet) -> dict:
        prompt = f"""Decide if this tweet is worth replying to.

Tweet by @{tweet['author']}: {tweet['content']}

Reply when you can add genuine value. Skip when off-topic or nothing to add.
Respond with ONLY: REPLY: [reason] or SKIP: [reason]"""

        response = self.haiku.invoke(prompt)
        return parse_decision(response)

No fancy logic. Just ask the cheap model to make a quick call. The prompt includes our guidelines—what topics we care about, what makes a reply valuable.

The expensive generator only sees tweets that passed:

# Tier 1: Fast filter
filtered = tweet_filter.batch_filter(tweets, max_replies=10)

# Tier 2: Deep generation (only on filtered tweets)
results = reply_generator.batch_generate(filtered, max_replies=5)

Why This Works

Filtering is easier than generating. Deciding “is this worth engaging?” is simpler than “what should we say?” Haiku handles the easy job, Sonnet handles the hard one.

Most content isn’t worth processing. In any feed, maybe 20% of posts warrant a response. Paying premium prices to analyze the other 80% is waste.

The tiers can have different capabilities. Our Sonnet tier has web search, extended thinking, access to our blog content. Haiku just needs the basic criteria. Match the model to the task.

The Pattern Generalizes

This isn’t specific to social media. Any pipeline where you’re processing items and most get filtered:

  • Document processing: Haiku classifies, Sonnet extracts
  • Support tickets: Haiku routes, Sonnet responds
  • Content moderation: Haiku flags, Sonnet reviews edge cases
  • Lead scoring: Haiku qualifies, Sonnet personalizes outreach

The principle: don’t pay for deep reasoning on items that don’t need it.

Tradeoffs

Latency increases slightly. Two API calls instead of one. For us, not a problem—this runs async on a schedule.

Filter quality matters. If Haiku incorrectly skips good tweets, Sonnet never sees them. We tuned the filter prompt to err toward REPLY when uncertain.

More code to maintain. Two models, two prompts, two sets of logic. Worth it for the cost savings, but it’s not free complexity.

Try It

If you’re running any LLM pipeline that processes batches and filters most of them:

  1. Measure what percentage you’re filtering out
  2. If it’s >50%, a cheap pre-filter probably pays for itself
  3. Start simple—just a KEEP/SKIP decision
  4. Tune the filter prompt based on what slips through

The 10x cost difference between model tiers is real. Use it.

How We Cut LLM Costs 75% With a 2-Tier Architecture
0:00
0:00