Home » How to Evaluate Graph Retrieval in MCP Agentic Systems

How to Evaluate Graph Retrieval in MCP Agentic Systems

days, it’s all about agents, which I’m all for, anbeyond basic vector search by giving LLMs access to a wide range of tools: 

  • Web search
  • Various API calls
  • Querying different databases

While there’s a surge in new MCP servers being developed, there’s surprisingly little evaluation happening. Sure, you can hook an LLM with various different tools, but do you really know how it’s going to behave? That’s why I’m planning a series of blog posts focused on evaluating both off-the-shelf and custom graph MCP servers, especially those that retrieve information from Neo4j.

Model Context Protocol (MCP) is Anthropic’s open standard that functions like “a USB-C port for AI applications,” standardizing how AI systems connect to external data sources through lightweight servers that expose specific capabilities to clients. The key insight is reusability. Instead of custom integrations for every data source, developers build reusable MCP servers once and share them across multiple AI applications.

Image from: https://modelcontextprotocol.io/introduction. Licensed under MIT.

An MCP server implements the Model Context Protocol, exposing tools and data to an AI client via structured JSON-RPC calls. It handles requests from the client and executes them against local or remote APIs, returning results to enrich the AI’s context.

To evaluate MCP servers and their retrieval methods, the first step is to generate an evaluation dataset, something we’ll use an LLM to help with. In the second stage, we’ll take an off-the-shelf mcp-neo4j-cypher server and test it against the benchmark dataset we created.

Agenda of this blog post. Image by author.

The goal for now is to establish a solid dataset and framework so we can consistently compare different retrievers throughout the series.

Code is available on GitHub.

Evaluation dataset

Last year, Neo4j released the Text2Cypher (2024) Dataset, which was designed around a single-step approach to Cypher generation. In single-step Cypher generation, the system receives a natural language question and must produce one complete Cypher query that directly answers that question, essentially a one-shot translation from text to database query.

However, this approach doesn’t reflect how agents actually work with graph databases in practice. Agents operate through multi-step reasoning: they can execute multiple tools iteratively, generate several Cypher statements in sequence, analyze intermediate results, and combine findings from different queries to build up to a final answer. This iterative, exploratory approach represents a fundamentally different paradigm from the prescribed single-step model.

Predefined text2cypher flow vs agentic approach, where multiple tools can be called. Image by author.

The current benchmark dataset fails to capture this difference of how MCP servers actually get used in agentic workflows. The benchmark needs updating to evaluate multi-step reasoning capabilities rather than just single-shot text2cypher translation. This would better reflect how agents navigate complex information retrieval tasks that require breaking down problems, exploring data relationships, and synthesizing results across multiple database interactions.

Evaluation metrics

The most important shift when moving from single-step text2cypher evaluation to an agentic approach lies in how we measure accuracy.

Difference between single-shot text2cypher and agentic evaluation. Image by author.

In traditional text2query tasks like text2cypher, evaluation typically involves comparing the database response directly to a predefined ground truth, often checking for exact matches or equivalence.

However, agentic approaches introduce a key change. The agent may perform multiple retrieval steps, choose different query paths, or even rephrase the original intent along the way. As a result, there may be no single correct query. Instead, we shift our focus to evaluating the final answer generated by the agent, regardless of the intermediate queries it used to arrive there.

To assess this, we use an LLM-as-a-judge setup, comparing the agent’s final answer against the expected answer. This lets us evaluate the semantic quality and usefulness of the output rather than the internal mechanics or specific query results.

Result Granularity and Agent Behavior

Another important consideration in agentic evaluation is the amount of data returned from the database. In traditional text2cypher tasks, it’s common to allow or even expect large query results, since the goal is to test whether the correct data is retrieved. However, this approach doesn’t translate well to evaluating agentic workflows.

In an agentic setting, we’re not just testing whether the agent can access the correct data, but whether it can generate a concise, accurate final answer. If the database returns too much information, the evaluation becomes entangled with other variables, such as the agent’s ability to summarize or navigate large outputs, rather than focusing on whether it understood the user’s intent and retrieved the correct information.

Introducing Real-World Noise

To further align the benchmark with real-world agentic usage, we also introduce controlled noise into the evaluation prompts.

Introducing real-world noise to evaluation. Image by author.

This includes elements such as:

  • Typographical errors in named entities (e.g., “Andrwe Carnegie” instead of “Andrew Carnegie”),
  • Colloquial phrasing or informal language (e.g., “show me what’s up with Tesla’s board” instead of “list members of Tesla’s board of directors”),
  • Overly broad or under-specified intents that require follow-up reasoning or clarification.

These variations reflect how users actually interact with agents in practice. In real deployments, agents must handle messy inputs, incomplete formulations, and conversational shorthand, which are conditions rarely captured by clean, canonical benchmarks.

To better reflect these insights around evaluating agentic approaches, I’ve created a new benchmark using Claude 4.0. Unlike traditional benchmarks that focus on Cypher query correctness, this one is designed to assess the quality of the final answers produced by multi-step agents

Databases

To ensure a variety of evaluations, we use a couple of different databases that are available on the Neo4j demo server. Examples include:

MCP-Neo4j-Cypher server

mcp-neo4j-cypher is a ready-to-use MCP tool interface that allows agents to interact with Neo4j through natural language. It supports three core functions: viewing the graph schema, running Cypher queries to read data, and executing write operations to update the database. Results are returned in a clean, structured format that agents can easily understand and use.

mcp-neo4j-cypher overview. Image by author.

It works out of the box with any framework that supports MCP servers, making it simple to plug into existing agent setups without extra integration work. Whether you’re building a chatbot, data assistant, or custom workflow, this tool lets your agent safely and intelligently work with graph data.

Benchmark

Finally, let’s run the benchmark evaluation.
We used LangChain to host the agent and connect it to the mcp-neo4j-cypher server, which is the only tool provided to the agent. This setup makes the evaluation simple and realistic: the agent must rely entirely on natural language interaction with the MCP interface to retrieve and manipulate graph data.

For the evaluation, we used Claude 3.7 Sonnet as the agent and GPT-4o Mini as the judge.
The benchmark dataset includes approximately 200 natural language question-answer pairs, categorized by number of hops (1-hop, 2-hop, etc.) and whether the queries contain distracting or noisy information. This structure helps assess the agent’s reasoning accuracy and robustness in both clean and noisy contexts. The evaluation code is available on GitHub.

Let’s examine the results together.

mcp-neo4j-cypher evaluation. Image by author

The evaluation shows that an agent using only the mcp-neo4j-cypher interface can effectively answer complex natural language questions over graph data. Across a benchmark of around 200 questions, the agent achieved an average score of 0.71, with performance dropping as question complexity increased. The presence of noise in the input significantly reduced accuracy, revealing the agent’s sensitivity to typos in named entities and such.

On the tool usage side, the agent averaged 3.6 tool calls per question. This is consistent with the current requirement to make at least one call to fetch the schema and another to execute the main Cypher query. Most queries fell within a 2–4 call range, showing the agent’s ability to reason and act efficiently. Notably, a small number of questions were answered with just one or even zero tool calls, anomalies that may suggest early stopping, incorrect planning, or agent bugs, and are worth further analysis. Looking ahead, tool count could be reduced further if schema access is embedded directly via MCP resources, eliminating the need for an explicit schema fetch step.

The real value of having a benchmark is that it opens the door to systematic iteration. Once baseline performance is established, you can start tweaking parameters, observing their impact, and making targeted improvements. For instance, if agent execution is costly, you might want to test whether capping the number of allowed steps to 10 using a LangGraph recursion limit has a measurable effect on accuracy. With the benchmark in place, these trade-offs between performance and efficiency can be explored quantitatively rather than guessed.

mcp-neo4j-cypher evaluation with max 10 steps. Image by author.

With a 10-step limit in place, performance dropped noticeably. The mean evaluation score fell to 0.535. Accuracy decreased sharply on more complex (3-hop+) questions, suggesting the step limit cut off deeper reasoning chains. Noise continued to degrade performance, with noisy questions averaging lower scores than clean ones.

Summary

We’re living in an exciting moment for AI, with the rise of autonomous agents and emerging standards like MCP dramatically expanding what LLMs can do, especially when it comes to structured, multi-step tasks. But while the capabilities are growing fast, robust evaluation is still lagging behind. That’s where this GRAPE project comes in.

The goal is to build a practical, evolving benchmark for graph-based question answering using the MCP interface. Over time, I plan to refine the dataset, experiment with different retrieval strategies, and explore how to extend or adapt the Cypher MCP for better accuracy. There’s still a lot of work ahead from cleaning data, improving retrieval to tightening evaluation. However, having a clear benchmark means we can track progress meaningfully, test ideas systematically, and push the boundaries of what these agents can reliably do.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *