糟糕的 MCP 设计让您的 Agent 消耗了 5 倍的 token
2 分•作者: JohnnyZhang483•27 天前
我最近对两个功能相同的 MCP 进行了测试,结果发现其中一个性能非常差。因此,我想分享导致这种现象的不良 MCP 设计模式。
一切都始于我为待办事项列表应用程序编写了一个 MCP 服务器(MCP-A)。后来,该应用程序正式发布了自己的 MCP 服务器(MCP-B)。这两个 MCP 具有相同的功能,并且都调用相同的后端 API。
实验设置如下:
- 两个 MCP 服务器都连接到同一个待办事项列表账户,并且每次测试后都会重置该账户。
- 40 个测试提示,模拟这些 MCP 的典型用例。
- 测试使用相同的模型、系统提示和 Agent 框架进行。
以下是测试结果:
| 指标 | MCP-A | MCP-B | 差距 |
| ------------------- | ----------- | ----------- | ----- |
| 工具描述长度 | 11,464 | 3,682 | — |
| 通过率 | 36/40 (90%) | 36/40 (90%) | 相同 |
| 总输入 token 数 | 637,244 | 3,174,329 | 4.98 倍 |
| 总输出 token 数 | 17,301 | 23,238 | 1.34 倍 |
| 总 Agent 步骤数 | 122 | 157 | 1.29 倍 |
| 总耗时 | 597 秒 | 676 秒 | 1.13 倍 |
结果显示,与 MCP-A 相比,MCP-B 完成 40 个测试用例多用了 35 次 ReAct 循环,这意味着输出 token 数增加了 30%。我检查了日志,发现根本原因是查询工具设计不佳。
以 `search tool` 为例,它的作用是在待办事项列表中查找一个待办事项。在 MCP-B 中,该工具返回以下内容:
```json
{
"id": "6a1916b48f08cb3a4c857ed0",
"title": "buy some groceries",
"url": "https://todo.example.com/tasks/6a1916b48f08cb3a4c857ed0"
}
```
但是其他 CRUD 操作需要 `project_id`,而 `search_tool` 没有返回它。因此,Agent 必须调用另一个工具 `get_task_by_id`。另一方面,MCP-A 的 `query_tasks` 在一次调用中返回了执行下一步操作所需的所有必要信息:
任务 1:
ID:6a19143e8f084a8c8101612f
标题:购买一些杂货
项目 ID:6a1914378f084a8c810160a9
开始日期:2025-07-19 10:00:00
优先级:中
状态:活动
未过滤的 API 数据已转储到上下文窗口
如果 MCP 将纯 API 结果未经处理地返回给 Agent 的上下文,Agent 的上下文窗口会迅速累积。
以 MCP-B 的 `create_task` 工具为例,它的作用是创建一个待办事项。该工具返回以下内容:
```json
{
"id": "6a180de78f086bdead0608be",
"projectId": "inbox125587327",
.....
"createdTime": "2026-05-28T09:41:59+0000",
"modifiedTime": "2026-05-28T09:41:59+0000",
"focusSummaries": null
}
```
这 600 多个字符对 Agent 的任务毫无意义,但仍然被转储到 Agent 的上下文中。另一方面,MCP-A 的 `create_tasks` 进行了过滤和格式化层。这个小小的调整在输入 token 使用量上产生了巨大的差异。
另一个问题是工具数量。更多的工具意味着模型可以选择的候选集更大,这直接增加了决策的难度。在 MCP-A 中,47 个工具被压缩到 14 个,用更少的工具覆盖了相同的功能。
---
因此,我关于良好 MCP 工具设计的体会是:
- 设计工具时,要考虑 Agent 接下来需要什么,而不仅仅是它当前请求的内容。在结果中返回足够的上下文,以便 Agent 可以在不进行另一次往返的情况下执行下一步操作。
- 过多的工具会增加模型的决策负担。因此,最好尽量减少 MCP 中的工具数量。确保它们的功能不重叠。
- 当你的 MCP 将数据返回给 LLM 时,尽量使其对 LLM 友好,即易于阅读。你可以过滤掉 API 响应中不必要的字段并格式化数据,而不是直接传递原始 JSON。
---
以上所有测试均由 MCP-Eval 运行,这是一个 MCP 服务器的基准测试工具。如果你想检查你的 MCP 的性能,请随时查看:
https://github.com/Code-MonkeyZhang/mcp-eval
查看原文
I recently did some tests on two MCPs with identical functionalities. Turns out one of them has really bad performance. So I wanna share those bad MCP design patterns that cause this.<p>It all started when I wrote an MCP Server (MCP-A) for a to-do list app. Later, the app officially released its own MCP Server (MCP-B). Both MCPs have the same functionalities and hit the same backend API.<p>The experiment is set up as follows:<p>- Both MCP Servers connect to the same ToDo list account, and it will be reset after each test.
- 40 test prompts to simulate typical use cases for these MCPs.
- The test was conducted with the same model, system prompt, and Agent framework<p>Here are the results:<p>| Metric | MCP-A | MCP-B | Gap |
| ------------------- | ----------- | ----------- | ----- |
| Tool Desc Length | 11,464 | 3,682 | — |
| Pass Rate | 36/40 (90%) | 36/40 (90%) | Same |
| Total input tokens | 637,244 | 3,174,329 | 4.98× |
| Total output tokens | 17,301 | 23,238 | 1.34× |
| Total Agent steps | 122 | 157 | 1.29× |
| Total time | 597s | 676s | 1.13× |<p>---<p>The result shows that MCP-B took 35 more ReAct loops to complete 40 test cases compared to MCP-A, which means 30% more output tokens. I examined the log and found that the root cause is poor query tool design.<p>Take the `search tool` for example, its job is to find a todo item in the ToDo list. In MCP-B, this tool returns this:<p>{
"id": "6a1916b48f08cb3a4c857ed0",
"title": "buy some groceries",
"url": "https://todo.example.com/tasks/6a1916b48f08cb3a4c857ed0"
}<p>But other CRUD operations require `project_id`, and `search_tool` doesn't return it. So the Agent has to call another tool `get_task_by_id`. On the other hand, MCP-A's query_tasks returns all necessary info to perform the next action in a single call:<p>Task 1:
ID: 6a19143e8f084a8c8101612f
Title: buy some groceries
Project ID: 6a1914378f084a8c810160a9
Start Date: 2025-07-19 10:00:00
Priority: Medium
Status: Active
Unfiltered API Data was dumped into context window<p>If MCP returns pure API results to the Agent's context unprocessed, the Agent's context window will accumulate very fast.<p>Take MCP-B's `create_task` tool, for example. Its job is to create a to-do item. This is what this tool returns:<p>{
"id": "6a180de78f086bdead0608be",
"projectId": "inbox125587327",
.....
"createdTime": "2026-05-28T09:41:59+0000",
"modifiedTime": "2026-05-28T09:41:59+0000",
"focusSummaries": null
}<p>These 600+ characters mean nothing to the Agent's task, but are still dumped into the Agent's context. On the other hand, MCP-A's create_tasks does a layer of filtering and formatting. This little tweak makes a huge difference in input token usage.<p>Another issue is tool count. More tools mean a larger candidate set for the model to choose from, which directly increases decision difficulty. In MCP-A, 47 tools were compressed down to 14, covering the same functionality with fewer tools.<p>---<p>So here are my takeaways on good MCP tool design:
- When designing a tool, think about what the Agent will need next, not just what it's asking for right now. Return enough context in the result so the Agent can take the next action without making another round-trip.<p>- Too many tools will increase the model's decision burden. So it'd be better to minimize the number of tools within an MCP. Make sure they don't overlap functionalities.<p>- When your MCP returns data to the LLM, try to keep it LLM-friendly, which means readable. You can filter out unnecessary fields from the API response and format the data, rather than passing through raw JSON.<p>---<p>All the tests above were run by MCP-Eval. It's an MCP Server benchmarking tool. If you want to check your MCP's performance, feel free to check this out.<p>https://github.com/Code-MonkeyZhang/mcp-eval