Definition
A benchmark of tool-calling tasks built on top of the Model Context Protocol.
A benchmark suite evaluating LLM tool-use agents on MCP-served tasks including filesystem and Postgres workloads, used as out-of-distribution transfer evaluation for Firefly-trained models.