Layer 4: Model Routing

How to Optimize Model Routing

Most teams pick one model and use it for everything. Opus for classification. Opus for generating test data. Opus for voting on options. Not every task needs your most expensive model. Here's a repeatable process to route the right model to the right task.

Step 1: Define your success signal

Model selection optimization has two things to measure:

Two layers of success

1. Quality parity: does a cheaper model produce the same result?
   Run the same task on multiple models. Compare outputs.
   If a $0.25/MTok model matches a $15/MTok model on this task,
   you are overpaying by 60x.

2. Cost per quality point: what are you paying for each unit of quality?
   Some tasks need the expensive model. The question is which ones.
   Map every task type to the cheapest model that meets the quality bar.

The goal is not to use the cheapest model everywhere. It is to stop using the most expensive model on tasks that do not need it. Most workflows spend 80% of tokens on tasks that run just as well on a smaller model.

Step 2: Generate test cases

Catalog every distinct task type in your AI workflow. Group them by complexity. Pull representative examples of each.

Example: AI agent with mixed task types

Low complexity (likely Haiku-capable):
  Classification: "Is this a billing question or a feature request?"
  Tagging: "Extract the product name and issue type from this ticket"
  Routing: "Which team should handle this?"
  Validation: "Does this JSON match the expected schema?"

Medium complexity (likely Sonnet-capable):
  Summarization: "Summarize this 3-page contract"
  Research: "Find the relevant docs for this API error"
  Sub-agent votes: "Which of these 3 options is best?"
  Data extraction: "Pull all dates and dollar amounts from this email"

High complexity (may need Opus):
  Multi-step reasoning: "Debug this failing pipeline"
  Code generation: "Write a migration script for this schema change"
  Strategy: "Propose an architecture for this new feature"
  Ambiguous judgment: "Should we approve this edge-case refund?"

Aim for 10 to 15 representative examples per complexity tier. The medium tier is the most important to test because that is where the biggest savings live. Tasks that feel like they need Opus but actually run fine on Sonnet.

Step 3: Benchmark the baseline

Run every test case on all three model tiers. Score the output quality for each. Record the cost per call. This gives you a quality-to-cost ratio for every task type on every model.

Baseline: all tasks on Opus

Test cases:      45 (15 per complexity tier)

Current setup: all tasks on Opus ($15/MTok)

Quality scores by tier (rubric: correctness, completeness, format):
  Low complexity:     Opus 4.6/5  |  Sonnet 4.5/5  |  Haiku 4.3/5
  Medium complexity:  Opus 4.4/5  |  Sonnet 4.2/5  |  Haiku 3.1/5
  High complexity:    Opus 4.5/5  |  Sonnet 3.8/5  |  Haiku 2.4/5

Cost per call (avg):
  Low complexity:     Opus $0.012  |  Sonnet $0.003  |  Haiku $0.0002
  Medium complexity:  Opus $0.038  |  Sonnet $0.008  |  Haiku $0.0006
  High complexity:    Opus $0.065  |  Sonnet $0.014  |  Haiku $0.0011

Current monthly cost (all Opus): $2,840
Task distribution: 45% low, 35% medium, 20% high

This is your floor. Notice that Haiku scores 4.3/5 on low-complexity tasks where Opus scores 4.6/5. That 0.3 point difference costs 60x more per call. And Sonnet matches Opus within 0.2 points on medium tasks at 5x less cost.

Step 4: Generate optimization candidates

Model selection has three optimization patterns. Use agents to propose a routing strategy:

Pattern 1: Task-level routing

Map every task type to the cheapest model that meets
your quality threshold.

Task                    Model      Quality   Cost
Classification          Haiku      4.3/5     $0.0002
Tagging                 Haiku      4.4/5     $0.0003
Summarization           Sonnet     4.2/5     $0.008
Sub-agent votes         Sonnet     4.1/5     $0.003
Code generation         Opus       4.5/5     $0.065
Complex reasoning       Opus       4.5/5     $0.065

Quality threshold: 4.0/5 minimum. Anything above that,
use the cheapest model that clears the bar.

Pattern 2: Cascade (try cheap first, escalate if needed)

For tasks where quality varies by input complexity:

1. Try Haiku first
2. Check confidence or output quality
3. If below threshold, retry on Sonnet
4. If still below threshold, escalate to Opus

~75% resolve at Haiku tier
~20% need Sonnet
~5% escalate to Opus

Best for: classification, routing, validation.
The majority of inputs are straightforward.

Pattern 3: Consensus with cheap models

For decisions and evaluations, 10 cheap agents beat 1 expensive one.

10x Sonnet agents with different framings:
  Cost: ~$0.40 total
  Result: consensus filters hallucinations, surfaces edge cases

1x Opus agent:
  Cost: ~$0.50
  Result: single perspective, no error correction

Cheaper AND more reliable. Stochastic variation across
many cheap models is a better strategy than one expensive run.

Use the same agent approaches (consensus, debate, or single model) to propose which tasks to downgrade, where to add cascades, and which decisions should use multi-agent consensus instead of a single expensive call.

Step 5: Test candidates against the same baseline

Run the exact same 45 test cases through the proposed routing strategy. Compare quality scores and cost to the all-Opus baseline.

After optimization: same 45 test cases

Test cases:      45

Routing strategy:
  Low complexity:     Haiku (15 tasks)
  Medium complexity:  Sonnet (12 tasks) + Cascade (3 tasks)
  High complexity:    Opus (10 tasks) + Consensus (5 tasks)

Quality scores (post-routing):
  Low complexity:     4.3/5 (was 4.6 on Opus. Delta: -0.3, acceptable.)
  Medium complexity:  4.2/5 (was 4.4 on Opus. Delta: -0.2, acceptable.)
  High complexity:    4.5/5 (unchanged. Opus still handles these.)

Cost per call (avg, post-routing):
  Low complexity:     $0.0002 (was $0.012)
  Medium complexity:  $0.009  (was $0.038)
  High complexity:    $0.052  (was $0.065)

Monthly cost (routed): $486
Previous monthly cost: $2,840

Cost

-83%

$2,840 → $486 per month

Quality

-0.2pts avg

4.5 → 4.3 avg. Within threshold.

Latency

2.8x faster

Smaller models respond faster

Step 6: Map to business outcomes

Model selection optimization delivers the largest percentage cost reduction of any surface because the price difference between tiers is 5 to 60x. Even a small shift in task routing changes the economics dramatically.

Cost-to-outcome by model tier

Model tier        Tasks     Before       After        Savings/mo
────────────────────────────────────────────────────────────────────
Low (→ Haiku)     45%       $1,278       $18          $1,260
Medium (→ Sonnet) 35%       $994         $284         $710
High (→ Opus)     20%       $568         $184         $384

Total monthly savings: $2,354
Annual savings:        $28,248

Quality impact: -0.2 points average.
No task dropped below the 4.0/5 minimum threshold.

The insight is not that cheaper models are good enough. It is that expensive models are wasted on simple tasks. The quality difference between Opus and Haiku on a classification task is negligible. The cost difference is 60x.

Then do it again

Model capabilities change with every release. A task that needed Opus six months ago might run fine on Sonnet today. New models get added, pricing changes, quality thresholds shift. The loop runs continuously:

1. Define success signal (quality parity + cost per quality point)
2. Catalog task types and pull representative examples
3. Benchmark every task on every model tier
4. Generate routing strategy (task-level, cascade, or consensus)
5. Test candidates: keep if quality stays above threshold at lower cost
6. Map to business outcomes: prioritize by task volume x cost delta
7. Re-benchmark after every major model release

The same discipline applies to any model provider. Anthropic, OpenAI, Google, open-source. Every provider has a tier structure. The question is always the same: which tasks are you overpaying for?

Cost

-83%

Right model for the right task

Quality

Maintained

Above threshold on every task

Tasks routed

80%

Shifted to cheaper models without quality loss