Layer 3: MCP Tools
How to Optimize MCP Tools
Every tool description is injected on every turn, whether the model needs it or not. Vague descriptions cause wrong tool selection. Too many tools create decision paralysis. Here's a repeatable process to trim, rewrite, and restructure your tool layer.
Step 1: Define your success signal
Tool layer optimization has two things to measure:
1. Tool selection accuracy: did the model pick the right tool? For each query, check if the model called the correct tool on the first attempt. Wrong tool selection means wasted calls, retries, and broken workflows. 2. Call efficiency: how many tool calls to complete the task? A well-structured tool layer lets the model resolve a query in 1-3 calls. A bloated one forces 5-10 calls as the model guesses, retries, and chains unnecessary lookups.
Tool selection accuracy is the primary signal. If the model picks the wrong tool, call efficiency collapses downstream. Fix selection first, then optimize for fewer calls.
Step 2: Generate test cases
Pull real queries from production that trigger tool calls. Include queries that should map to a single tool, queries that need multiple tools in sequence, and queries that should not trigger any tool at all.
Single tool (clear mapping): "What's the status of order #4821?" → search_orders "Update the customer's email address" → update_user "Create a ticket for this billing issue" → create_ticket Multi-tool (sequence required): "Refund order #4821 and notify the customer" → get_order → refund → send_notification "Look up this user's last 3 orders" → get_user → search_orders No tool needed: "What's your return policy?" → answer from prompt, no tool call "Thanks, that's all I needed" → close conversation "Can you explain what that error means?" → explain, don't look up Ambiguous (model has to decide): "Search for information about this customer" → get_user? search_orders? both? "Send them something about the update" → send_notification? which template?
Aim for 30 to 50 test queries. The ambiguous cases are the most important. Those are where vague tool descriptions cause the model to guess wrong.
Step 3: Benchmark the baseline
Run every test query against the current tool configuration. Record which tool the model selected, whether it was correct, how many total calls it made, and the token overhead from tool descriptions.
Test queries: 48 Tool selection accuracy: Correct first pick: 31/48 (64.6%) Wrong tool, retried: 11/48 (22.9%) Wrong tool, failed: 6/48 (12.5%) Call efficiency: Avg calls per task: 6.2 (target: 2-3) Unnecessary calls: 3.1 per task (lookups, retries, wrong tools) Tool utilization: Tools called at least once: 17/28 (60.7%) Tools never called: 11/28 (39.3%) Token overhead: Tool descriptions: 6,120 tokens per turn Avg total per call: 6,120 (tools) + 1,800 (prompt) + 640 (completion) Cost per call: $0.0257
This is your floor. Notice that 11 tools were never called. That is 2,400 tokens of dead weight on every single turn. And the 64.6% first-pick accuracy means the model is guessing wrong on a third of queries.
Step 4: Generate optimization candidates
Tool layer optimization has three levers. Use agents to propose changes across all three:
The #1 cause of wrong-tool selection is vague descriptions. Before: "Search the database for information. You can search for users, orders, products, or anything else." After: "Query orders by user_id or order_id. Returns: order_status, total, created_at. Use ONLY for order lookup. Not users (use get_user) or products (use search_catalog)." Scoped descriptions with negative constraints tell the model exactly when to use this tool and when NOT to.
Remove the 11 tools that were never called. For complex multi-tool sequences, create a skill that orchestrates the calls instead of letting the model figure out the sequence on its own. Example: instead of the model chaining get_order → check_refund_eligibility → process_refund → send_notification Create a "process_refund" skill that handles the whole sequence. One tool call instead of four. Less room for error.
Add a tool routing section to the system prompt so the model knows which tool to reach for first. Tool routing - Order questions → search_orders - User account issues → get_user, update_user - Billing disputes → get_invoice, create_ticket - Refunds → process_refund (skill, handles full sequence) - Everything else → ask for clarification first
Use the same agent approaches (consensus, debate, or single model) to propose which descriptions to rewrite, which tools to prune, and where to add skills or routing hints.
Step 5: Test candidates against the same baseline
Run the exact same 48 test queries against the optimized tool configuration. Compare tool selection accuracy, call efficiency, and token cost to the baseline.
Test queries: 48 Tool selection accuracy: Correct first pick: 45/48 (93.8%) Wrong tool, retried: 2/48 (4.2%) Wrong tool, failed: 1/48 (2.1%) Call efficiency: Avg calls per task: 2.4 (down from 6.2) Unnecessary calls: 0.3 per task (down from 3.1) Tool configuration: Tools exposed: 17 (down from 28, 11 pruned) Skills added: 3 (refund, onboarding, escalation) Routing hints: 8 Token overhead: Tool descriptions: 1,840 tokens per turn (down from 6,120) Avg total per call: 1,840 (tools) + 1,800 (prompt) + 380 (completion) Cost per call: $0.0121
Selection
+29pts
64.6% → 93.8% first-pick accuracy
Calls
2.6x fewer
6.2 → 2.4 avg calls per task
Cost
-53%
$0.0257 → $0.0121 per call
Step 6: Map to business outcomes
Tool layer waste compounds differently than prompt or context waste. Every wrong tool call is a wasted API call, added latency, and potential for cascading errors. The cost is not just tokens. It is reliability.
Workflow Calls/mo Before After Savings/mo ──────────────────────────────────────────────────────────────────── Support agent 42,000 $1,079 $508 $571 Order management 28,000 $719 $339 $380 Billing automation 15,000 $386 $182 $204 Internal tools 8,500 $218 $103 $115 Total monthly savings: $1,270 Annual savings: $15,240 Additional: 3.8x fewer unnecessary tool calls = faster responses, fewer errors, fewer escalations to human agents.
Tool layer optimization often has the highest ROI because the savings are multiplicative. Fewer calls per task means less cost, less latency, and fewer failure points. A task that took 6 calls now takes 2.
Then do it again
Tool configurations drift just like prompts. New MCP servers get connected, new tools get exposed, descriptions get copy-pasted from docs without editing. The loop runs continuously:
1. Define success signal (selection accuracy + call efficiency) 2. Generate test queries from production tool call logs 3. Benchmark selection accuracy, call count, and token overhead 4. Generate optimization candidates (rewrite, prune, add skills/routing) 5. Test candidates: keep if selection improves at lower call count 6. Map to business outcomes: prioritize by call volume x savings 7. Re-audit quarterly, or when new tools/servers are added
The same discipline applies to any tool integration. MCP servers, function calling, API tools, custom skills. If the model has to choose between tools, the descriptions and routing determine whether it chooses right.
Selection
+29pts
Right tool, first try
Calls
2.6x fewer
Less cost, less latency, fewer errors
ROI
53%
Fewer calls, fewer errors, less cost