Layer 1: Prompts
How to Optimize Prompts
Most teams use AI to generate an initial system prompt without a process for optimizing it. These are the static instructions your AI runs on every call. Not user messages, but the rules that shape every response. Here's a repeatable process to continuously improve them.
Step 1: Define your success signal
Before touching the prompt, decide what you're measuring. Most prompts have two layers to test:
1. Trigger accuracy: did the right workflow fire? Pass the input to the model with the system prompt and ask: "Would you have handled this? Yes or no." Binary. Cheap. High volume. 2. Output quality: given it triggered, is the result good? You don't need to run the full workflow. Ask: "Given this input and system prompt, how would you classify this?" Compare the answer to what's expected.
Test triggers first. No point evaluating output quality on a workflow that shouldn't have fired. This applies differently depending on what you're testing: routing prompts are almost entirely trigger accuracy, scoped skills care more about output quality, system prompts need both plus constraint adherence.
Step 2: Generate test cases
Pull real inputs from production where possible. Include cases that should trigger the workflow, cases that shouldn't, and edge cases that could go either way.
Should trigger: "I was charged twice for my subscription" "Can you help me downgrade my plan?" "My API key stopped working after the update" "I need a refund for last month" "How do I cancel?" Should NOT trigger: "What's your pricing?" "Do you have a Go SDK?" "Can I talk to someone about a partnership?" "How does your product compare to [competitor]?" "I'd like to schedule a demo" Edge cases: "I'm having trouble with billing AND want to see new features" "Cancel my subscription and recommend an alternative" "Is there a discount if I stay?"
Aim for 30 to 50 test cases. Weight by real-world frequency if you have usage data. A trigger that fires 500 times a day matters more than one that fires twice a month.
Step 3: Benchmark the baseline
Run every test case against the current prompt. For trigger accuracy, pass each input and ask the model what it would have done. For output quality, compare the classification against the expected result. Record token count per call.
Test cases: 48 Trigger accuracy: Correct fires: 31/36 (86.1%) Correct rejects: 7/12 (58.3%) → 5 false positives (fired on sales/partner queries) Output quality (on correct triggers): Correct classification: 27/31 (87.1%) Wrong category: 4/31 Combined quality score: 56.3% (27/48 fully correct end-to-end) Avg tokens/call: 2,840 (prompt) + 380 (completion) Cost per call: $0.0091
This is your floor. Every optimization is measured against these numbers. No baseline means no proof that changes helped.
Step 4: Generate optimization candidates
This is where most teams guess. Instead, use agents to propose changes systematically. There are three approaches, use whichever fits:
Spawn 10 agents with different analytical framings (risk-averse, contrarian, first-principles, etc.) Each independently proposes optimizations. Take the changes most agents agree on. Those are safe bets. Flag the splits for human judgment.
Spawn 3 agents into a shared conversation: Architect: thinks in systems Pragmatist: optimizes for shipping Critic: finds edge cases Three rounds of debate. They argue, concede, converge. Synthesize the result.
Pass the prompt + baseline results to a single model. "Here's my system prompt. Here's where it's failing. 5 false positives on sales queries. 4 misclassifications on billing complaints. Propose specific changes to fix these failures without breaking what's already working."
Each approach generates candidate rewrites. The next step decides which ones ship.
Step 5: Test candidates against the same baseline
Run the exact same test cases against each candidate. Compare to the baseline. Keep the change if the quality score goes up. Revert if it drops. One change at a time so you know what worked.
Test cases: 48 Trigger accuracy: Correct fires: 35/36 (97.2%) Correct rejects: 11/12 (91.7%) → 1 false positive (down from 5) Output quality (on correct triggers): Correct classification: 34/35 (97.1%) Wrong category: 1/35 Combined quality score: 93.8% (45/48 fully correct end-to-end) Avg tokens/call: 1,020 (prompt) + 290 (completion) Cost per call: $0.0031
Quality
+37pts
56.3% → 93.8% end-to-end
Tokens
-64%
3,220 → 1,310 per call
Cost
-66%
$0.0091 → $0.0031 per call
Step 6: Map to business outcomes
Optimizing prompts in isolation is half the picture. Map each workflow to the revenue it supports to know where optimization matters most.
Workflow Cost/mo Revenue Cost:Revenue Priority ──────────────────────────────────────────────────────────────────── Ticket classifier $382 $18,000 2.1% HIGH Lead scoring $254 $18,000 1.4% HIGH Contract summarizer $89 $6,000 1.5% MEDIUM Onboarding assistant $136 cost center REVIEW Code review agent $244 cost center REVIEW
Most teams optimize the prompt that annoys them most. This tells you which prompt matters most. The one connected to the highest-value outcome, running at the highest volume, with the most room to improve.
Then do it again
This isn't a one-time cleanup. Prompts drift. New rules get added, models get updated, edge cases pile up. The loop runs continuously:
1. Define success signal (trigger accuracy + output quality) 2. Generate test cases from production inputs 3. Benchmark current quality score and token cost 4. Generate optimization candidates (consensus, debate, or single model) 5. Test candidates → keep if quality improves 6. Map to business outcomes → prioritize by value 7. Re-benchmark quarterly, or after any model/prompt change
The same discipline applies to every layer of your instruction stack. Binary routing checks for orchestrators, rubric-based scoring for task prompts, constraint adherence testing for system prompts. The method adapts. The loop doesn't change.
Quality
+37pts
Measured, not guessed
Cost
-66%
Token savings at higher quality
ROI
Mapped
Every workflow tied to an outcome