mcp-bench

# MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers [![arXiv](https://img.shields.io/badge/arXiv-2508.20453-b31b1b.svg)](https://arxiv.org/abs/2508.20453) [![Leaderboard](https://img.shields.io/badge/🤗%20Hugging%20Face-Leaderboard-FFD21E)](https://huggingface.co/spaces/mcpbench/mcp-bench) [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Python Version](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/) [![MCP Protocol](https://img.shields.io/badge/MCP-Protocol-green)](https://github.com/anthropics/mcp) ![MCP-Bench](./images/mcpbench_intro.png) ## Overview MCP-Bench is a comprehensive evaluation framework designed to assess Large Language Models' (LLMs) capabilities in tool-use scenarios through the Model Context Protocol (MCP). This benchmark provides an end-to-end pipeline for evaluating how effectively different LLMs can discover, select, and utilize tools to solve real-world tasks. ## News * [2025-09] MCP-Bench is accepted to NeurIPS 2025 Workshop on Scaling Environments for Agents. ## Leaderboard | Rank | Model | Overall Score | |------|-------|---------------| | 1 | gpt-5 | 0.749 | | 2 | o3 | 0.715 | | 3 | gpt-oss-120b | 0.692 | | 4 | gemini-2.5-pro | 0.690 | | 5 | claude-sonnet-4 | 0.681 | | 6 | qwen3-235b-a22b-2507 | 0.678 | | 7 | glm-4.5 | 0.668 | | 8 | gpt-oss-20b | 0.654 | | 9 | kimi-k2 | 0.629 | | 10 | qwen3-30b-a3b-instruct-2507 | 0.627 | | 11 | gemini-2.5-flash-lite | 0.598 | | 12 | gpt-4o | 0.595 | | 13 | gemma-3-27b-it | 0.582 | | 14 | llama-3-3-70b-instruct | 0.558 | | 15 | gpt-4o-mini | 0.557 | | 16 | mistral-small-2503 | 0.530 | | 17 | llama-3-1-70b-instruct | 0.510 | | 18 | nova-micro-v1 | 0.508 | | 19 | llama-3-2-90b-vision-instruct | 0.495 | | 20 | llama-3-1-8b-instruct | 0.428 | *Overall Score represents the average performance across all evaluation dimensions including rule-based schema understanding, LLM-judged (o4-mini as judge model) task completion, tool usage, and planning effectiveness. Scores are averaged across single-server and multi-server settings.* ## Quick Start ### Installation 1. **Clone the repository** ```bash git clone https://github.com/accenture/mcp-bench.git cd mcp-bench ``` 2. **Install dependencies** ```bash conda create -n mcpbench python=3.10 conda activate mcpbench cd mcp_servers # Install MCP server dependencies bash ./install.sh cd .. ``` 3. **Set up environment variables** ```bash # Create .env file with API keys # Default setup uses both OpenRouter and Azure OpenAI # For Azure OpenAI, you also need to set your API version in file benchmark_config.yaml (line205) # For OpenRouter-only setup, see "Optional: Using only OpenRouter API" section below cat > .env << EOF export OPENROUTER_API_KEY="your_openrouterkey_here" export AZURE_OPENAI_API_KEY="your_azureopenai_apikey_here" export AZURE_OPENAI_ENDPOINT="your_azureopenai_endpoint_here" EOF ``` 4. **Configure MCP Server API Keys** Some MCP servers require external API keys to function properly. These keys are automatically loaded from `./mcp_servers/api_key`. You should set these keys by yourself in file `./mcp_servers/api_key`: ```bash # View configured API keys cat ./mcp_servers/api_key ``` Required API keys include (These API keys are free and easy to get. You can get all of them within 10 mins): - `NPS_API_KEY`: National Park Service API key (for nationalparks server) - [Get API key](https://www.nps.gov/subjects/developer/get-started.htm) - `NASA_API_KEY`: NASA Open Data API key (for nasa-mcp server) - [Get API key](https://api.nasa.gov/) - `HF_TOKEN`: Hugging Face token (for huggingface-mcp-server) - [Get token](https://huggingface.co/docs/hub/security-tokens) - `GOOGLE_MAPS_API_KEY`: Google Maps API key (for mcp-google-map server) - [Get API key](https://developers.google.com/maps) - `NCI_API_KEY`: National Cancer Institute API key (for biomcp server) - [Get API key](https://clinicaltrialsapi.cancer.gov/signin) This api key registration website might require US IP to open, see Issue #10 if you have difficulies for getting this api key. ### Basic Usage ```bash # 1. Verify all MCP servers can be connected ##You should see "28/28 servers connected" ##and "All successfully connected servers returned tools!" after running this python ./utils/collect_mcp_info.py # 2. List available models source .env python run_benchmark.py --list-models # 3. Run benchmark (gpt-oss-20b as an example) ##Must use o4-mini as judge model (hard-coded in line 429-436 in ./benchmark/runner.py) to reproduce the results. ## run all tasks source .env python run_benchmark.py --models gpt-oss-20b ## single server tasks source .env python run_benchmark.py --models gpt-oss-20b \ --tasks-file tasks/mcpbench_tasks_single_runner_format.json ## two server tasks source .env python run_benchmark.py --models gpt-oss-20b \ --tasks-file tasks/mcpbench_tasks_multi_

Описание

Отзывы (0)

Статистика

Информация

Технологии

Похожие серверы

mcp-chain-of-draft-server

mcp-use-ts

mesh

rhinomcp