Home » Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT | by Shirley Li | Jan, 2025

Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT | by Shirley Li | Jan, 2025

Evaluation of Alignment

How to properly evaluate “alignment” is also challenging, as the definition of alignment is not as clear as other aspects such as accuracy. In this work the authors define alignment as if the models are “helpful, honest, and harmless” and convert them to more measurable properties:

  • Helpful: by measuring if the model could follow instructions and even infer intentions from a few-shot prompt.
  • Honest: by measuring truthfulness, or in the author’s words, “if the model’s statements about the world are true”. More specifically, they propose to measure it by hallucination rate on the TruthfulQA dataset.
  • Harmless: by measuring “if an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content”, and benchmarking on datasets designed to measure bias and toxicity.

On top of that, to make sure the finetuning process will not cause severe regressions on pre-training performance, the evaluation process also need to reflect quality on both the pre-training and finetuning objectives. For that reason, InstructGPT was evaluated on two separate datasets:

  • Evaluations on API distribution: this is mainly for evaluating the finetuning quality, by asking human labelers to rate which output is preferred;
  • Evaluations on public NLP datasets: this evaluates both the pre-training and finetuning quality, including traditional NLP datasets as well as datasets for evaluating model safety like truthfulness, toxicity and bias.

Next, we will briefly explain how RLHF works and how it is implemented in InstructGPT.

RLHF (Reinforcement Learning from Human Feedback)

The figure below shows the 5 elements in a typical Reinforcement Learning scenario:

Figure 7. Five elements in RL: Agent, Environment, Reward, State and Action. (Image from wiki.)

Now imagine you are teaching your puppy to sit, where you can find all the 5 elements:

  • Agent: Your puppy learning this new command “sit”.
  • Environment: Everything around your puppy.
  • State: The situation your puppy is in (whether it is sitting or not).
  • Reward: A treat that you give your puppy when it follows your command;
  • Action: What your puppy could do, like sitting, jumping or barking.

Reinforcement Learning works like this: In the beginning your dog (agent) didn’t understand what “sit” means, but it will try different things like running, sitting or even barking (actions) in your house (environment). Every time it sits, it will get a treat (reward). Over time your puppy learns that sitting gets a treat and it appears like it finally understands “sit”.

Training a model with RL follows a very similar trial-and-error approach. The key to RL is having a well-designed reward. This reward must be closely aligned with the goal; otherwise the agent will not be able to learn the desired behaviors. Meanwhile, producing such a reward should be as easy and quick as possible, since if it is too slow or too complicated to calculate the reward, the RL process will also become extremely slow, making it less useful in practical tasks.

For example, in a game, every action the agent takes will automatically get a score from the environment, and this score is directly connected to your agent’s performance in playing this game.

However, in many real-world applications, there is no ready-to-use reward like a score in a game. Instead researchers have to take great efforts in defining a proper reward function. Moreover, some desired behaviors are very difficult to translate into reward functions — for example, how could you define a reward function to guide the agent to answer questions more politely?

This leads to RLHF: Reinforcement Learning from Human Feedback.

Again in the puppy training example, imagine your puppy finally learns to sit, but sometimes it also barks while sitting, or it will jump onto the couch first instead of sitting quietly on the floor.

What can you do in that case?

With RLHF, you don’t just give your puppy a treat every time it sits. Instead, you give treats by comparing its behaviors. For example, if the puppy sits quietly on the floor, it gets a bigger reward than if it sits while barking or after jumping onto the couch. This way, your puppy learns that sitting quietly on the floor is better, even though you didn’t explicitly explain what “quiet” means.

As we mentioned before, having an easy and fast reward is the key to RL, which makes it unrealistic to involve a human into the training loop to provide direct feedback. To overcome this issue, we can collect some human feedback first, and then use these feedback to learn a reward function to mimic human preferences when comparing two actions.

In summary, RLHF typically involves three stages:

  • Collect human feedback: sampling model outputs, and ask human judges to compare which is better.
  • Learn a reward model by mimicking human judge’s preferences.
  • Train a better policy using the leant reward model in the RL process.

In case you are not familiar with RL terminology: a policy refers to the agent’s strategy to choose actions based on the state of the environment.

Next we will cover how this RLHF approach is implemented in finetuning InstructGPT.

Implementation of RLHF in InstructGPT

InstructGPT and ChatGPT were trained using the same model (see this blog), with RLHF being the key element in finetuning.

The training process largely follows the steps we have introduced in the previous section, with special care on data quality and implementation details, which in my opinion, are equivalently important to make InstructGPT such a success.

Now let me break it down.

Figure 8. An illustration of the RLHF steps in training InstructGPT/ChatGPT. (image from InstructGPT paper.)

Step 1: Collect demonstration data and train a supervised policy

In this step, human labelers were asked to provide high-quality demonstrations of the desired behavior for each prompt.

Prompt dataset: To begin with, you need to have a prompt dataset from which you can sample individual prompts, and ideally that prompt dataset should be both useful and diverse.

To do that, the authors took an iterative approach: in the very beginning, labelers were asked to manually write some seed prompts, and these data were used to train a model via supervised learning. This model was later deployed to the OpenAI API to collect text prompts from users, which later formed the prompt dataset.

The table below shows the distribution of this prompt dataset, as diversity is very important in making sure the model will be trained on various tasks:

Human data collection: human data are needed in three components throughout the RLHF process, including writing demonstrations in Step 1, providing comparison data in Step 2, and conducting final evaluations after finetuning.

In the paper the authors mentioned many practices to ensure data quality:

  • Firstly, high-quality data come from good labelers. To ensure their ability in data labeling, a screening test was conducted to select labelers who were “sensitive to the preferences of different demographic groups, and were good at identifying outputs that were potentially harmful”.
  • Secondly, to ensure consistency between all the labelers, an onboarding process was setup to train all labelers, and detailed instructions for each task were provided. The authors also mentioned that they setup a shared chat room to answer questions from labelers.
  • Finally, to see how the model generalizes to the preferences of different labelers, a separate group of labelers who didn’t got through the screening test were hired for evaluation.

Based on these human demonstration data, a pretrained GPT-3 model was finetuned using supervised learning in the first step. This model is referred to as the baseline policy, which will be used to produce comparison outputs in Step 2 and initialize the PPO algorithm in Step 3.

Step 2: Collect comparison data and train a reward model

Comparison data collection: Once the baseline policy is available, it is used to generate outputs for some sampled prompts, and these outputs will be reviewed and ranked by human labelers from the best to the worst. To speedup this ranking process, a set of K outputs will be shown simultaneously to the human labelers, where K ranges from 4 to 9.

Reward model training: The reward model was initialized from the supervised baseline policy, by removing the final unembedding layer and training on the comparison data. In particular, the authors mention that training all comparisons from each prompt as a single batch rather than shuffling the comparisons can help alleviate overfitting. It was trained to assign scalar scores to input-response pairs, with 6B parameters. Note that we need to seek a balance when deciding the size of this reward model: it needs to be sufficiently large to accurately mimic human preferences, however it cannot be too large since it needs to support fast inference during the RL process.

Step 3: Optimize a policy using the reward model with PPO

At this point we have got everything ready to finetune the model with RLHF: the initial policy and the reward model. The training in this step follows a typical RL process: in each episode, a new prompt is sampled (the “state”) and new outputs will be generated (the model’s “action”) by the current policy (the “agent”), and then the reward model will calculate a reward for the output (“reward”), according to which the policy will be updated using PPO.

Don’t worry if you are not familiar with PPO — it is simply a method designed to help the agent to slowly update its strategies.

A few things to mention here:

  • A per-token KL penalty is added at each token to mitigate the over-optimization of the reward model.
  • The authors further experimented with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets (such regressions are often called “the alignment tax”), which was referred to as “PPO-ptx”. In this paper, InstructGPT actually refers to the PPO-ptx models.

Note that Step 2 and Step 3 can be iterated continuously:

  • With an updated policy (from Step 3), we can generate new outputs and collect more comparison data, which can be used to train a new reward model by repeating Step 2;
  • With a new reward model (from Step 2), we can get a better policy by repeating Step 3.

Findings in Evaluation

Due to space limitation we will not go through all the evaluation results in this article, instead we will just highlight several new findings.

As perhaps the most important finding, results show that RLHF can indeed improve alignment. The figure below shows the win rate against the supervised 175B GPT3 model, evaluated by human judges. According to this figure, both PPO and PPO-ptx significantly outperform the GPT baselines, where even the 1.3B PPO models are better than the 175B GPT-3. This result clearly demonstrates the effectiveness of RLHF.

Figure 9. Human evaluation results. (Image from InstructGPT paper.)

The authors also found that InstructGPT show improves in truthfulness (hallucination rate reduced from 41% to 21%), slight improvements in toxicity (25% fewer toxic outputs), but no significant improvements on reducing bias.

Another finding is that PPO-ptx can minimize performance regressions on public NLP datasets, as shown in the figure below.

Figure 10. Few-shot performance on public NLP datasets. (Image from InstructGPT paper.)

Training a LLM usually involves multiple stages like pre-training, supervised finetuning, and alignment with RLHF. For our tasks at hand, we can usually start from an open-source, pre-trained LLM and finetune it on domain-specific data.

A few questions to ask while finetuning your own LLMs (though this is not meant to be an exhaustive list):

  • Do we have a clear definition on the model’s desired behaviors? How can we evaluate such behaviors? If no available metrics to use, can we create one by ourselves?
  • Do we have available training data? If not, how can we collect such data by ourselves? If human labelers are needed, how to ensure their labeling quality?
  • What kind of cleaning or pre-processing is needed? Any heuristics can we use to check the data quality?
  • Does our data cover a wide range of scenarios?
  • Do we need to modify our tokenizers? Do we need to modify the model structures? Do we need to add auxiliary finetuning objectives?
  • Does finetuning lead to regression on pre-training performance? Can we seek a balance?
  • Does finetuning lead to some unexpected negative behaviors? How can we mitigate that?
  • How to prevent overfitting in the finetuning process?
  • What hyper-parameters can we tune during finetuning or during evaluation? Any heuristics we can leverage?

In the end of the day, exploring a new task is always both challenging and exciting, and I hope the learnings from this article can help make it less challenging, more exciting, and ultimately more enjoyable 🙂

Thanks for reading!

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *