Judging with Confidence: Meet PGRM, the Promptable Reward Model

AI is transforming how businesses operate, but ensuring your AI systems are truly helpful, safe, and aligned with your requirements remains a major challenge—especially as you put them into production at scale. Manual review is slow and expensive, while existing monitoring tools can be rigid, inefficient, or lack transparency. What if you could reliably monitor, evaluate, and control your AI’s behavior with a single, adaptable tool—no deep expertise required?

That’s where Databricks’ new Prompt-Guided Reward Model (PGRM) comes in. Think of PGRM as your AI’s quality control inspector—one that can instantly adapt to new rules, flag uncertain cases for review, and provide clear, confidence-backed scores for every decision. It’s as flexible as an LLM judge, but as efficient and calibrated as a purpose-built classifier. Whether you want to enforce safety guidelines, ensure factual accuracy, or align outputs with your brand, PGRM makes it possible to do so at scale and with transparency.

Why does this matter? With PGRM, you can:

Unify your LLM guardrails and evaluation with a single adaptable prompt
Focus your experts’ time where it matters most
Adapt oversight as your needs evolve—without retraining from scratch

Not only that, but PGRM can also power advanced reward modeling workflows—helping you automatically surface the best responses from your AI, fine-tune models to your specific needs with reinforcement learning, and drive continuous improvement with far less manual effort.

PGRM provides the best of both an LLM judge and a reward model. As an LLM judge, it achieves an average accuracy of 83.3% in our internal benchmarks measuring judgment quality, matching GPT-4o (83.6%) across key evaluation tasks like answer correctness and faithfulness to context. As a reward model, on RewardBench2, a challenging new public benchmark for reward modeling, PGRM ranks as the #2 sequential classifier and #4 overall, with an overall score of 80.0—outpacing most dedicated reward models and even surpassing frontier LLMs like GPT-4o (64.9) and Claude 4 Opus (76.5) in fine-grained reward assessment. This makes PGRM the first model to deliver state-of-the-art results in both instructable judging and high-precision reward modeling without compromising efficiency.

Now, let’s take a closer look at how PGRM bridges the gap between traditional reward models and flexible LLM judges, and what that means for building trustworthy AI.

PGRM: A New, Instructable Hybrid

The need for scalable oversight of AI behavior has never been greater. The most common automated solution to this problem is using an LLM to “judge” whether your AI system has behaved properly according to a set of guidelines. This judge approach leans on LLMs’ ability to follow diverse natural language instructions, for instance, by giving the LLM judge a rubric that explains how to grade various inputs. Want to know if an output is “safe,” “truthful,” or “on-brand”? Just change the rubric. However, LLM judges are costly and are notoriously bad at estimating their own confidence in the accuracy of their judgments.

What about reward models (RMs)? These are a specialized type of classifier trained to predict how a human would rate an AI response. RMs are typically used to align foundation models with human preferences in techniques like RLHF. They are efficient and scalable, since they don’t need to generate any outputs, and are useful for test-time compute, surfacing the best response among many generated by your AI. Unlike LLM judges, they are calibrated: in addition to generating a prediction, they also accurately guess how certain or uncertain they are about whether that prediction is right. But they usually aren’t part of the conversation when it comes to things like evaluation or monitoring, arguably because they lack the instructability of an LLM judge. Instead, each RM is tuned to a fixed specification or set of criteria—updating or steering its definition of “good” means expensive retraining from scratch. For this reason, RMs are usually only considered for RLHF, test-time compute workflows like best-of-N, or RL fine-tuning methods like TAO.

We developed the PGRM because judging and reward modeling are two sides of the same coin, despite often being treated as separate. PGRM bridges this gap by packaging an LLM judge in the form of an RM. The result is a model that brings together the best of both worlds – the speed and calibration of an RM with the instructability of an LLM judge – yielding a hybridization that unlocks new potential on both fronts.

	Reward Models	LLM Judges	PGRM
Instructable	❌	✅	✅
Scalable	✅	❌	✅
Calibrated	✅	❌	✅

Let’s define some of these key concepts. Instructable means that the system allows for arbitrary natural language instructions describing how an example should be scored or judged. As a simple example, “What is the capital of France? Paris.” may be good if the guideline is ‘be correct’ but bad if the guideline is ‘answer in complete sentences’. Instructable systems let you define these rules. Scalable approaches are those that avoid the overhead associated with LLMs (i.e., the time and cost incurred by generating text). Finally, at a high level, calibrated essentially means that the system not only judges something as good or bad, but also conveys how confident it is in that judgement. Good calibration is useful for many tasks, such as prioritizing which LLM outputs are most likely to be problematic and identifying the best response among a set of candidates. It also adds a layer of interpretability and control in the context of evaluation. PGRM combines all of these features into one model.

Putting PGRM to Work

PGRM unlocks a new toolkit for AI on Databricks and adds a new level of customization to RM-based methods for improving your AI systems. Here’s how PGRM could reshape the AI development lifecycle:

Simplified Oversight: Imagine managing both a guardrail and judge with a single, tunable prompt. PGRM’s instructability means you can focus your evaluation efforts and keep your AI aligned with evolving business rules—all with one prompt.
Targeted Quality Triage and Smarter Labeling: PGRM’s calibrated confidence scores help you zero in on the ambiguous cases that need expert attention. That means less wasted effort reviewing your AI system, and faster curation of high-quality datasets.
Domain-Expert Alignment: Easily tune what counts as a “good” or “bad” response to match your organization’s standards. PGRM’s tunable score helps ensure automated judgments stay in sync with your experts, building trust and improving accuracy.
Continuous Model Improvement: Leverage PGRM’s reward modeling capabilities to automatically surface and promote the best AI responses during TAO–with full control over what “best” means. By fine-tuning your models with PGRM, you can drive targeted improvements in quality, safety, and alignment.

Benchmarking PGRM as a Judge

PGRM provides a judging system that is as adaptable as an LLM, but as practical and efficient as a purpose-built reward model. In contrast to reward models, a “judge” is not a type of model – it’s essentially a set of instructions provided to a standard LLM. That is, you typically create a judge by instructing an LLM to evaluate a response according to some criteria. Therefore, judging responses across a variety of quality dimensions requires a model that can follow instructions. Standard RMs don’t meet that requirement, so typical practice is to resort to LLM judges. PGRM, however, is an RM designed to handle instructions like a judge.

To demonstrate that PGRM can handle the type of judgment tasks required for evaluating and monitoring AI systems, we compare its judgment accuracy against that of GPT-4o across a handful of tasks; specifically, the same tasks powering our mlflow evaluation product.

This plot shows the average and per-task accuracies of PGRM and GPT-4o across our internal benchmark. Each task here is defined by a specific instruction asking the model to judge a given response in some particular way. For instance, Answer Correctness requires the model to determine whether the response agrees with a pre-verified ground-truth and Faithfulness asks if the response was supported by available context. As shown, PGRM achieves near parity with GPT-4o, effectively matching the judgment quality of a frontier LLM.

Judging with Confidence

As an instructable reward model, PGRM matches the judgment capabilities of a powerful LLM while introducing scalability and calibration. An LLM judge can offer a good pass/fail judgment, but will not reliably indicate its confidence. As a model fundamentally built for classification, PGRM’s scores naturally indicate its confidence in its verdict, with more extreme scores indicating higher certainty.

The figure on the left illustrates calibration. We’re overlaying two histograms: PGRM scores for benchmark examples where the ground-truth verdict was “pass” (green) and those with ground-truth “fail” (orange). We can measure the ratio of pass/fail examples in each score bucket (red) and compare that to what we’d expect from a perfectly calibrated classifier (black), observing a close correspondence. In other words, when PGRM tells you that it’s confidence is 70%, it will be correct about 70% of the time.

In contrast, LLMs are well known for being capable classifiers but worse at reporting their own confidence. This translates to good accuracy in judging pass/fail but no scrutability in terms of how close the judgment was to the decision boundary. Interestingly, however, we find that for examples where PGRM is least confident, GPT-4o is also least accurate. This is captured in the figure on the right. This hints that PGRM and GPT-4o are picking up on the same sources of ambiguity or difficulty, but only PGRM makes these cases identifiable.

This isn’t just a neat property of PGRM, but introduces important new functionality as a judge. For one, well calibrated confidence scores let you distinguish obvious failures in your AI system from borderline ones, making it easier to identify high-priority examples for further review. In addition, recalibrating PGRM to be more conservative or more permissive is simply a matter of picking a pass/fail score threshold that best suits your application. In contrast, because LLMs do not externalize their confidence, calibrating them has to be done at the prompt level, requiring either additional prompt engineering (harder than it sounds) or few-shot demonstrations (making it even more expensive to run).

Benchmarking RM Quality on RewardBench2

PGRM lets us look at judging and reward modeling as two sides of the same coin. In both cases, we’re essentially trying to measure how good an AI’s response is, but in the case of reward modeling, the emphasis is on measuring that quality at a high degree of precision. At a high level, RMs need to be able to surface the best response from a set of candidates. RewardBench2 is the latest benchmark designed to measure exactly that ability. As of the time of this blog, PGRM ranks as the second overall sequential classifier model and fourth overall among all models on the RewardBench2 leaderboard.

This plot shows the per-subset and overall performance of several models on RewardBench2. PGRM is competitive with Skywork-Reward-V2-Llama-3.1-8B, the leading model, and outranks all other sequential classifier models. It’s worth emphasizing that GPT-4o performs poorly as a reward model, demonstrating that LLMs like GPT-4o are simply not trained to produce well calibrated scores. They are useful for coarse judgment (i.e. pass/fail), but aren’t the right tool for the job when you need something more fine-grained.

What’s Next

By bringing together reward modeling and judging, PGRM lets us ask more from each. RM-based fine-tuning with rewards tailored to your specific requirements, replacing generic notions of “good responses” with those that actually reflect what you care about. Judges that allow you to monitor your AI agents at scale. Customizable guardrail models efficient enough to work with your agents online. PGRM opens the door to all of these fronts.

We’re already using PGRM to power our research & products. For instance, within Agent Bricks Custom LLM, we use PGRM as the reward model when doing TAO fine-tuning. So, thanks to PGRM, Agent Bricks lets you build a high-quality model that’s optimized for your task and guidelines, even without labeled data. And this is just one of many applications we envision.

PGRM represents just the first step in this direction and inspires a new agenda of research in steerable reward modeling. At Databricks, we’re looking forward to extending PGRM in a few exciting directions. By modifying the training recipe, we can teach PGRM to perform fine-grained, token-level judgments, making it a particularly powerful tool when applied at inference time, for guardrails, value-guided search, and more! In addition, we’re exploring ways to bring test-time compute to PGRM itself, in the form of novel architectures that combine reasoning and calibrated judgment.

If you’re interested in trying out PGRM for your use case, fill out this form and our team will be in touch.

Judging with Confidence: Meet PGRM, the Promptable Reward Model

PGRM: A New, Instructable Hybrid

Putting PGRM to Work

Benchmarking PGRM as a Judge

Judging with Confidence

Benchmarking RM Quality on RewardBench2

What’s Next

Related Posts

Reducing Time to Value for Data Science Projects: Part 4

Netradyne Uses AI Dashcams To Reward Safe Driving

Leave a Reply Cancel reply