Home » Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals

Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals

Reinforcement Learning with Verifiable Rewards (RLVR) allows LLMs to perform complex reasoning on tasks with clear, verifiable outcomes, with strong performance in mathematics and coding. However, many real-world scenarios lack such explicit verifiable answers, posing a challenge for training models without direct reward signals. Current methods address this gap through RLHF via preference ranking, where human judgments are collected over pairs or lists of model outputs. Moreover, preference-based reward models can boost performance in the early stages, but they tend to overfit to superficial artifacts such as response length, formatting quirks, and annotator biases. These models require large volumes of pairwise comparisons, making them brittle and costly.

RLVR methods now extend beyond mathematics and coding, with GENERAL-REASONER demonstrating strong performance in physics, finance, and policy, achieving a ten-point gain on MMLU-Pro through GRPO fine-tuning. Rubric-based evaluation has become a standard for advanced LLMs, with frameworks like HEALTHBENCH pairing clinician-written criteria with automated judges to evaluate factuality, safety, and empathy. However, these rubrics appear only during evaluation phases rather than training. Moreover, process supervision methods try to provide more granular feedback by rewarding intermediate reasoning steps through MCTS-generated labels and generative reward models such as THINKPRM.

Researchers from Scale AI have proposed Rubrics as Rewards (RaR), an on-policy reinforcement learning framework that utilizes checklist-style rubrics to guide multi-criteria tasks.     The method generates prompt-specific rubrics based on carefully designed principles, where each rubric outlines clear standards for high-quality responses and provides human-interpretable supervision signals. Moreover, it is applied to medicine and science domains, resulting in two specialized training datasets, RaR-Medicine-20k and RaR-Science-20k. RaR enables smaller judge models to achieve superior alignment with human preferences by transforming rubrics into structured reward signals while maintaining robust performance across different model scales.

Researchers used LLMs as expert proxies to generate these rubrics, ensuring adherence to the following desiderata: grounded in expert guidance, comprehensive coverage, semantic weighting, and self-contained evaluation. For each domain, specialized prompts instruct the LLM to generate 7-20 rubric items based on the complexity of the input question. Each item is assigned categorical weights, such as Essential Criteria or Important Criteria, to determine its significance for correct answers. The training utilizes the GRPO algorithm with Qwen2.5-7B as the base policy model. Moreover, the training pipeline operates through three core components: Response Generation, Reward Computation, and Policy Update. 

The RaR-Implicit method outperforms baseline methods such as Simple-Likert, with the best variant achieving up to 28% relative improvement on HealthBench-1k and 13% on GPQA.   It also outperforms both base and instruction-tuned policy models, showing the effectiveness of rubric-guided training for nuanced response evaluation while matching or exceeding Reference-Likert baseline performance. Beyond raw metrics, rubric-guided evaluations provide clearer and more accurate signals across model scales, achieving higher accuracy when preferred responses receive appropriate ratings. Moreover, expert guidance proves essential for synthetic rubric generation, with rubrics developed using reference answers achieving higher accuracy than those without human insights.

In summary, researchers introduced RaR that advances post-training of language models by using structured, checklist-style rubrics as reward signals. It offers stable training signals, maintaining human interpretability and alignment. However, this research remains limited to medical and science domains, requiring validation across tasks such as open-ended dialogue. Researchers explored only two reward aggregation strategies, implicit and explicit, leaving the alternative weighting schemes. Moreover, they did not conduct a controlled analysis of reward hacking risks, and the reliance on off-the-shelf LLMs as judges suggests future work could benefit from dedicated evaluators with enhanced reasoning capabilities.


Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *