Save 98% on AI Agent Tokens With This One Trick

Prompt Engineering

5,811 views • 3 days ago

Video Summary

This video explores ten techniques to drastically reduce token usage in AI agents, leading to more cost-effective and efficient operations. Strategies range from simple configuration adjustments to advanced code execution and programmatic tool calling. A standout method, code execution, can achieve up to a 98% reduction in token usage by treating the MCP server as a file system, allowing agents to explore and load only necessary tool definitions. This approach not only minimizes token count but also enhances security by keeping sensitive intermediate data out of the model's context window.

A key takeaway is that by intelligently managing tool definitions and data flow, organizations can significantly lower operational costs and improve AI performance. For instance, the Bright Data MCP server, which is open-source and available on GitHub, offers a free tier of 5,000 requests per month, making it accessible for prototyping and development.

Short Highlights

Code execution can reduce token usage by up to 98% by treating MCP servers as file systems, loading only necessary tool definitions.
Tool search, using regex or BM25, helps agents dynamically discover and load thousands of tools on demand, reducing initial token definitions by over 85%.
Scope loading and specific tool selection allow for dynamic loading of tool groups or individual tools, optimizing token usage by only paying for what's needed.
Dynamic context loading progressively discloses information to the agent in three levels as it figures out what it needs.
Programmatic tool calling allows Claude to write Python code to call tools, with intermediate results not entering the model's context window.
Output optimization techniques include stripping formatting from data and using Token Oriented Object Notation (TOON) for flat, uniform data, potentially saving 30-60% over JSON.
Combining multiple optimization techniques is recommended for maximum token window efficiency.
The Bright Data MCP server is open-source, available on GitHub under an MIT license, and offers a free tier of 5,000 requests per month.

Key Details

Code Execution [00:29]

This method treats the MCP server as a file system, allowing agents to explore and load only the specific tool definitions needed for a task.
Introduced by Anthropic and independently developed by Cloudflare as "code mode," this pattern significantly reduces token count.
An example shows moving a Google Drive document to Salesforce, reducing token usage from 150,000 to around 2,000 tokens (a 98% reduction).
Secondary benefits include filtering large datasets in code before they reach the model and keeping sensitive data out of the context window.

This is the most powerful approach in this video but it's only the highest complexity because you do need a real sandbox with proper isolation and resource limits.

Tool Search Tools [02:27]

Agents are given basic tools and a search tool to dynamically discover and load additional tools from a catalog on demand.
Variants use regex or BM25 for natural language ranking, enabling agents to work with thousands of tools.
Setting tools to "default loading to true" allows for dynamic loading when needed.
This approach reduces tool definition tokens by over 85% compared to loading all definitions upfront (e.g., 55,000 tokens reduced).

The numbers Anthropic published here were pretty significant. A typical multi-server setup adds up to around 55,000 tokens of tool definitions before any work starts.

Scope Loading [03:44]

Tools are grouped by similarity (e.g., e-commerce, finance, social media), and only the specific group needed for a session is dynamically loaded.
Introduced by Bright Data, this method allows users to specify groups via URL parameters or environment variables.
This provides efficient token utilization by loading only necessary tools, saving costs.
Further optimization involves defining specific tools for a session, loading only those exact tool names.

You only pay for the tools you actually need and the implementation is open source so you can use the same pattern in your own MCP servers as well.

Dynamic Context Loading [04:44]

Inspired by Claude skills, this method progressively discloses information to the agent in three levels as it determines its needs.
Level 1: Agent is told which MCP servers are available.
Level 2: Agent receives a list of tools within a relevant server with one-line summaries.
Level 3: Agent pulls the name, description, and input schema for a specific tool.
This saves tokens by only including relevant information and can be combined with group or custom tool approaches.

The idea is that you have three levels of disclosure and the agent walks down them as it figures out what it needs.

Programmatic Tool Calling [07:10]

A separate Anthropic API feature where Claude writes code to call tools as Python functions.
Crucially, intermediate results from these tool calls do not enter the model's context window; only the final code output does.
This shifts the model from reasoning over hundreds of kilobytes to reasoning over a handful of lines.
It involves marking callable tools with allowed_callers set to "code execution" and adding the code execution tool to the tool list.

Adding programmatic tool calling on top of a basic search was the key factor that fully unlocked agent performance.

Layered MCP Server Design [08:29]

This architectural pattern involves three layers between the LLM and the underlying MCP implementation: discovery, planning, and execution.
It functions like a sub-agent design, where most actions occur within the sub-agent, keeping the orchestrator's context pristine.
This approach is most beneficial at scale, with numerous MCP servers or when different teams manage distinct tools.
It provides a clean interface in front of underlying tools and servers.

This is more of an architectural pattern than feature you turn on and it really only makes sense at scale when you have a lot of underlying MCP servers or when different teams own different tools and you want a clean interface in front of them.

Input and Output Token Optimization [09:22]

Input optimization involves stripping formatting (like markdown) from data, such as web search results, to pass only plain text.
Further input optimization includes light parsing of Google search results to extract only top organic results, excluding ads and related searches.
Output optimization involves using Token Oriented Object Notation (TOON) for flat, uniform data.
TOON declares field names once and streams values like CSV, reducing token usage by 30-60% compared to standard JSON, but it's less effective for deeply nested structures.

The principle holds. So, this is something practical to consider if you're implementing an MCP server that has a lot of tools that interact with documents or web pages that have formatting in them.

Stacking Techniques for Maximum Efficiency [11:24]

The video emphasizes combining multiple approaches for optimal token window utilization.
Recommended stacking includes: tool groups for scoping, tool search for un-grouped tools, programmatic tool calling for multi-step workflows, output stripping for formatted data, and TOON encoding for flat tabular responses.
Code execution with MCP can replace direct tool calls entirely for maximum effect.
The Bright Data MCP server is open-source and available on GitHub under an MIT license, with a generous free tier for prototyping.

The beauty is that all of this is open source. Especially, the Bright Data MCP server is on GitHub under MIT license.