How to Create an LLM Judge That Aligns with Human Labels

applications with LLMs, you’ve probably run into this challenge: how do you evaluate the quality of the AI system’s output?

Say, you want to check whether a response has the right tone. Or whether it’s safe, on-brand, helpful, or makes sense in the context of the user’s question. These are all examples of qualitative signals that are not easy to measure.

The issue is that these qualities are often subjective. There is no single “correct” answer. And while humans are good at judging them, humans don’t scale. If you are testing or shipping LLM-powered features, you will eventually need a way to automate that evaluation.

LLM-as-a-judge is a popular method for doing this: you prompt an LLM to evaluate the outputs of another LLM. It’s flexible, fast to prototype, and easy to plug into your workflow.

But there is a catch: your LLM judge is also not deterministic. In practice, it’s like running a small machine learning project, where the goal is to replicate expert labels and decisions.

In a way, what you are building is an automated labeling system.

That means you must also evaluate the evaluator to check whether your LLM judge aligns with human judgment.

In this blog post, we will show how to create and tune an LLM evaluator that aligns with human labels – not just how to prompt it, but also how to test and trust that it’s working as expected.

We will finish with a practical example: building a judge that scores the quality of code review comments generated by an LLM.

Disclaimer: I am one of the creators of Evidently, an open-source tool that we’ll be using in this example. We will use the free and open-source functionality of the tool. We will also mention the use of Open AI and Anthropic models as LLM evaluators. These are commercial models, and it will cost a few cents on API calls to reproduce the example. (You can also replace them for open-source models).

What is an LLM evaluator?

LLM evaluator – or LLM-as-a-judge – is a popular technique that uses LLMs to assess the quality of outputs from AI-powered applications.

The idea is simple: you define the evaluation criteria and ask an LLM to be the “judge.” Say, you have a chatbot. You can ask an external LLM to evaluate its responses, looking at things like relevance, helpfulness, or coherence – similar to what a human evaluator can do. For example, each response can be scored as “good” or “bad,” or assigned to any specific category based on your needs.

The idea behind LLM-as-a-judge. Image by author

Using an LLM to evaluate another LLM might sound counterintuitive at first. But in practice, judging is often easier than generating. Creating a high-quality response requires understanding complex instructions and context. Evaluating that response, on the other hand, is a more narrow, focused task – and one that LLMs can handle surprisingly well, as long as the criteria are clear.

Let’s look at how it works!

How to create an LLM evaluator?

Since the goal of an LLM evaluator is to scale human judgments, the first step is to define what you want to evaluate. This will depend on your specific context – whether it’s tone, helpfulness, safety, or something else.

While you can write a prompt upfront to express your criteria, a more robust approach is to act as the judge first. You can start by labeling a dataset the way you would want the LLM evaluator to behave later. Then treat those labels as your target and try writing the evaluation prompt to match them. This way, you will be able to measure how well your LLM evaluator aligns with human judgment.

That’s the core idea. We will walk through each step in more detail below.

*The workflow for creating an LLM judge. Image by author*

Step 1: Define what to judge

The first step is to decide what you’re evaluating.

Sometimes this is obvious. Say, you’ve already observed a specific failure mode when analyzing the LLM responses – e.g., a chatbot refusing to answer or repeating itself – and you want to build a scalable way to detect it.

Other times, you’ll need to first run test queries and label your data manually to identify patterns and develop generalizable evaluation criteria.

It’s important to note: you don’t have to create one cover-it-all LLM evaluator. Instead, you can create multiple “small” judges, each focusing on a specific pattern or evaluation flow. For example, you can use LLM evaluators to:

Detect failure modes, like refusals to answer, repetitive answers, or missed instructions.
Calculate proxy quality metrics, including faithfulness to context, relevance to the answer, or correct tone.
Run scenario-specific evaluations, like testing how the LLM system handles adversarial inputs, brand-sensitive topics, or edge cases. These test-specific LLM judges can check for correct refusals or adherence to safety guidelines.
Analyze user interactions, like classifying responses by topic, query type, or intent.

The key is scoping each evaluator narrowly, as well-defined, specific tasks are where LLMs excel.

Step 2: Label the data

Before you ask an LLM to make judgments, you need to be the judge yourself.

You can manually label a sample of responses. Or you can create a simple labeling judge and then review and correct its labels. This labeled dataset will be your “ground truth” that reflects your preferred judgment criteria.

As you do this, keep things simple:

Stick to binary or few-class labels. While a 1-10 scale might seem appealing, complex rating scales are difficult to apply consistently.
Make your labeling criteria clear enough for another human to follow them.

For example, you can label the responses on whether the tone is “acceptable”, “not acceptable” or “borderline”.

*Stick to binary or low-precision scores for better consistency. Image by author*

Step 3: Write the evaluation prompt

When you know what you are looking for, it’s time to build the LLM evaluator! Evaluation prompts are the core of your LLM judge.

The core idea is that you should write this evaluation prompt yourself. This way, you can customize it to your use case and leverage domain knowledge to improve the quality of your instructions over a generic prompt.

If you use a tool with built-in prompts, you should test them against your labeled data first to ensure the rubric aligns with your expectations.

You can think of writing prompts as giving instructions to an intern doing the task for the first time. Your goal is to make sure your instructions are clear and specific, and provide examples of what “good” and “bad” mean for your use case in a way that another human can follow them.

Step 4: Evaluate and iterate

Once your evaluation prompt is ready, run it across your labeled dataset and compare the outputs against the “ground truth” human labels.

To evaluate the quality of the LLM evaluator, you can use correlation metrics, like Cohen’s Kappa, or classification metrics, like accuracy, precision, and recall.

Based on the evaluation results, you can iterate on your prompt: look for patterns to identify areas for improvement, adjust the judge and re-evaluate its performance. Or you can automate this process through prompt optimization!

Step 5: Deploy the evaluator

Once your judge is aligned with human preferences, you can put it to work, replacing manual review with automated labeling through the LLM evaluator.

For example, you can use it during prompt experiments to fix a specific failure mode. Say, you observe a high rate of refusals, where your LLM chatbot frequently denies the user queries it should be able to answer. You can create an LLM evaluator that automatically detects such refusals to answer.

Once you have it in place, you can easily experiment with different models, tweak your prompts, and get measurable feedback on whether your system’s performance gets better or worse.

Code tutorial: evaluating the quality of code reviews

Now, let’s apply the process we discussed in a real example, end-to-end.

We will create and evaluate an LLM judge to assess the quality of code reviews. Our goal is to create an LLM evaluator that aligns with human labels.

In this tutorial, we will:

Define the evaluation criteria for our LLM evaluator.
Build an LLM evaluator using different prompts/models.
Evaluate the quality of the judge by comparing results to human labels.

We will use Evidently, an open-source LLM evaluation library with over 25 million downloads.

Let’s get started!

Complete code: follow along with this example notebook.

Prefer video? Watch the video tutorial.

Preparation

To start, install Evidently and run the necessary imports:

!pip install evidently[llm]

You can see the complete code in the example notebook.

You will also need to set up your API keys for LLM judges. In this example, we will use OpenAI and Anthropic as the evaluator LLMs.

Dataset and evaluation criteria

We will use a dataset that comprises 50 code reviews with expert labels – 27 “bad” and 23 “good” examples. Each entry includes:

Generated review text
Expert label (good/bad)
Expert comment explaining the reasoning behind assigned labels.

*Examples of generated reviews and expert labels from the dataset. Image by author*

The dataset used in the example was generated by the author and is available here.

This dataset is an example of the “ground truth” dataset you can curate with your product experts: it shows how a human judges the responses. Our goal is to create an LLM evaluator that returns the same labels.

If you analyze the human expert comments, you can notice that the reviews are primarily judged on actionability – Do they provide actual guidance? – and tone – Are they constructive rather than harsh?

Our goal with creating the LLM evaluator will be to generalize these criteria in a prompt.

Initial prompt and interpretation

Let’s start with a basic prompt. Here is how we express our criteria:

A review is GOOD when it’s actionable and constructive.
A review is BAD when it’s non-actionable or overly critical.

In this case, we use an Evidently LLM evaluator template, which takes care of generic parts of the evaluator prompt – like asking for classification, structured output, and step-by-step reasoning – so we only need to express the actual criteria and give the target labels.

We will use GPT-4o mini as an evaluator LLM. Once we have the final prompt, we will run the LLM evaluator over the generated reviews and compare the good/bad labels it returns against the expert ones.

To see how well our naive evaluator matches the expert labels, we will look at classification metrics like accuracy, precision, and recall. We will visualize the results using the Classification Report in the Evidently library.

*Alignment with human labels and classification metrics for the initial prompt. Image by author*

As we can see, only 67% of the judge labels matched the labels given by human experts.

The 100% precision score means that when our evaluator identified a review as “bad,” it was always correct. However, the low recall indicates that it missed many problematic reviews – our LLM evaluator made 18 errors.

Let’s see if we can do better with a more detailed prompt!

Experiment 2: more detailed prompt

We can look closer at the expert comments and specify what we mean by “good” and “bad” in more detail.

Here’s a refined prompt:

A review is **GOOD** if it is actionable and constructive. It should:
    - Offer clear, specific suggestions or highlight issues in a way that the developer can address
    - Be respectful and encourage learning or improvement
    - Use professional, helpful language—even when pointing out problems

A review is **BAD** if it is non-actionable or overly critical. For example:
    - It may be vague, generic, or hedged to the point of being unhelpful
    - It may focus on praise only, without offering guidance
    - It may sound dismissive, contradictory, harsh, or robotic
    - It may raise a concern but fail to explain what should be done

We made the changes manually this time, but you can also employ an LLM to help you rewrite the prompt.

Let’s run the evaluation once again:

*Classification metrics for a more detailed prompt. Image by author*

Much better!

We got 96% accuracy and 92% recall. Being more specific about evaluation criteria is the key. The evaluator got only two labels wrong.

Although the results already look pretty good, there are a few more tricks we can try.

Experiment 3: ask to explain the reasoning

Here’s what we will do – we will use the same prompt but ask the evaluator to explain the reasoning one more time:

Always explain your reasoning.

*Classification metrics for a detailed prompt, if we ask to explain the reasoning. Image by author*

Adding one simple line pushed performance to 98% accuracy with only one error in the entire dataset.

Experiment 4: switch models

When you are already happy with your prompt, you can try running it with a cheaper model. We use GPT-4o mini as a baseline for this experiment and re-run the prompt with GPT-3.5 Turbo. Here’s what we’ve got:

GPT-4o mini: 98% accuracy, 92% recall
GPT-3.5 Turbo: 72% accuracy, 48% recall

*Classification metrics for a detailed prompt, if we switch to a cheaper model (GRT-3.5 Turbo). Image by author*

Such a difference in performance brings us to an important consideration: prompt and model work together. Simpler models may require different prompting strategies or more examples.

Experiment 5: switch providers

We can also check how our LLM evaluator works with different providers – let’s see how it performs with Anthropic’s Claude.

*Classification metrics for a detailed prompt using another provider (Anthropic). Image by author*

Both providers achieved the same high level of accuracy, with slightly different error patterns.

The table below summarizes the results of the experiment:

Scenario	Accuracy	Recall	# of errors
Simple prompt	67%	36%	18
Detailed prompt	96%	92%	2
“Always explain your reasoning”	98%	96%	1
GPT-3.5 Turbo	72%	48%	13
Claude	96%	92%	2

Table 1. Experiment results: tested scenarios and classification metrics

Takeaways

In this tutorial, we went through an end-to-end workflow for creating an LLM evaluator to assess the quality of code reviews. We defined the evaluation criteria, prepared the expert-labeled dataset, crafted and refined the evaluation prompt, ran it against different scenarios, and compared the results until we aligned our LLM judge with human labels.

You can adapt this workflow to fit your specific use case. Here are some of the takeaways to keep in mind:

Be the judge first. Your LLM evaluator is there to scale the human expertise. So the first step is to make sure you have clarity on what you are evaluating. Starting with your own labels on a set of representative examples is the best way to get there. Once you have it, use the labels and expert comments to determine the criteria for your evaluation prompt.

Focus on consistency. Perfect alignment with human judgment isn’t always necessary or realistic – after all, humans can also disagree with each other. Instead, aim for consistency in your evaluator’s judgments.

Consider using multiple specialized judges. Rather than creating one comprehensive evaluator, you can split the criteria into separate judges. For example, actionability and tone could be evaluated independently. This makes it easier to tune and measure the quality of each judge.

Start simple and iterate. Begin with naive evaluation prompts and gradually add complexity based on the error patterns. Your LLM evaluator is a small prompt engineering project: treat it as such, and measure the performance.

Run evaluation prompt with different models. There is no single best prompt: your evaluator combines both the prompt and the model. Test your prompts with different models to understand performance trade-offs. Consider factors like accuracy, speed, and cost for your specific use case.

Monitor and tune. LLM judge is a small machine learning project in itself. It requires ongoing monitoring and occasional recalibration as your product evolves or new failure modes emerge.