ChatGPT, a powerful language model, is capable of answering questions about code snippets, summarizing documents, and writing coherent short stories. Despite its limitations and occasional mistakes, ChatGPT showcases the progress in language modeling over the past few years. In this article, we will specifically cover the training objectives used to develop ChatGPT.

Now it has plenty of limitations and is very capable of making mistakes, but it represents just how far language modeling has come these past few years.

Training Overview

Let's understand the training objectives used to develop Chat GPT. 

Well, it all starts with a method similar to Instruct GPT, which came out about a year ago. The main difference is that Chat GPT focuses on engaging in interactive dialogues instead of just following instructions.

Let me break it down for you in three simple steps:

1. First, we have "generative pre-training." In this stage, a language model is trained on tons of text data, so it can learn from all kinds of conversations and contexts.

2. Next up is "supervised fine-tuning." Here, the model learns from humans who demonstrate the best chatbot behavior. This way, it can better understand how to respond to users in a helpful and engaging way.

3. Finally, we have "reinforcement learning from human feedback," or RLHF for short. In this stage, the model learns from humans who rank different responses, helping it understand which answers are the most preferred. This information is then used to fine-tune the model even further.

Throughout these steps, the model is constantly being refined based on the previous stage's results. It's like a cycle of improvement, helping Chat GPT become better and better at holding conversations with users like you!

Generative Pretraining 

What is a language model anyway? 

It's essentially a type of autoregressive sequence model that focuses on predicting the next element based on previous data.

Here's how it works:

During training, the model analyzes sequences from a dataset and adjusts its parameters to maximize the likelihood of predicting the correct next element (like words or sub-words).

This approach has been used in various domains, such as audio waveforms and molecular graphs. With language models, the elements are called "tokens" and can represent words or parts of words.

Given a partial sentence, a language model predicts the likelihood of the next word in its vocabulary, with most words having close to zero probability and only a few plausible ones getting a higher probability.

Although language models are designed to predict the next token, they are often used to generate sequences during inference. They do this by sampling tokens from the model's output probability distribution until a special "stop" token is reached.

Modern language models, like GPT-3 from OpenAI or Google's PaLM and BARD, are usually large Transformers with billions of parameters. They're trained on vast amounts of text from various sources like chat forums, blogs, books, scripts, academic papers, and more.


🔥 Comparison between ChatGPT and BARD

However, there's a limit to how much prior context these models can use during inference. For Chat GPT, it can handle around 3,000 words, which is enough for short conversations but not for generating an entire novel. 

A recent release of GPT 4 shows that users can now receive the output of more than 25k words, caption given images, and with their plugin feature which is just rolled out in March 2023 GPT can now access the internet with a browser plugin, so the demanded information data will be up to date.

There are also other plugins for their specific use cases like code interpretation and 3rd party integrations like Zapier.

By training on such diverse text data, language models can learn complex relationships between words, sentences, and paragraphs, enabling them to tackle a wide range of language use cases.

Try exploring a data pool of more than 1000+ datasets

The Alignment Problem

Why isn't a basic language model objective, like generative modeling, enough to achieve the desired behavior we see in Chat GPT?

The issue lies in the difference between the language modeling objective and the actual tasks developers or users want the model to perform, such as following instructions or engaging in informative dialogues.

Language model pre-training involves a vast mixture of tasks, and user input might not be enough to clarify the specific task at hand. For instance, if a user asks, "Explain how the bubble sort algorithm works," it's clear to us what they want. However, the model is trained to generate plausible continuations, so it might respond with something related but not exactly what the user intended.

To make a language model perform a specific task, we can use prompting by conditioning it on a manually constructed example. But this process can be tedious for the user.

The basic language model objective isn't sufficient because of the misalignment between the language modeling objective and the desired tasks, the ambiguity of user input, and the need to accommodate subjective developer preferences.

Supervised Fine-tuning

In the second stage, the model undergoes fine-tuning using supervised learning. Human contractors engage in conversations, acting as both the user and the ideal chatbot. These dialogues are compiled into a dataset, with each training example consisting of a conversation history and the chatbot's subsequent response. The objective is to maximize the probability that the model assigns to the token sequence in the corresponding response.

This process resembles imitation learning, specifically behavior cloning, where the model attempts to mimic an expert's actions based on input states. With this step, the fine-tuned model responds better to user requests, reducing the need for prompting. However, it still has limitations.

Limitations of Supervision: Distributional Shift

In imitative settings like language modeling or other domains such as driving or gaming, there is a distributional shift issue. During training, the distribution of states depends on the human demonstrator's behavior, but during inference, the model influences the distribution of visited states. Due to practical factors like insufficient data, partial environment observability, or optimization difficulties, the model may only approximate the expert's policy.

As the model acts, it may make mistakes that the human demonstrator wouldn't. This can lead to novel states with less training data support, causing a compounding error effect. The expected error can grow quadratically with the length of an episode.

For language models, early mistakes may lead to overconfident assertions or nonsensical outputs. To address this, the model needs to actively participate in training, not just passively observe an expert.

Reward Learning Based on Preferences

A method to further improve the model is by incorporating reinforcement learning (RL) into the fine-tuning process.

In some RL settings, a predefined reward function is available. However, defining a reward function can be challenging in other cases, such as language modeling, where success in a conversation is difficult to quantify precisely.

Assigning numerical scores directly to a conversation might be hard to calibrate. To address this, ChatGPT developers establish a reward function based on human preferences. AI trainers engage in conversations with the current model, and for each model response, alternative responses are sampled. Human labelers then rank these responses from most to least preferred.

To convert this information into a scalar reward suitable for reinforcement learning, a separate reward model, initialized with weights from the supervised model, is trained on these rankings. The reward model assigns a scalar score to each response pair, representing logits or unnormalized log probabilities. The higher the score, the greater the probability the model assigns to that response being preferred.

Reinforcement Learning from Human Feedback (RLHF)

During the reinforcement learning stage, the chatbot is fine-tuned from the final supervised model. It responds to humans in conversations by emitting sequences of tokens. For each conversation history and corresponding action, the reward model returns a numerical reward.

Proximal Policy Optimization (PPO) is chosen as the reinforcement learning algorithm. Although PPO is popular across various domains, it is important to note that the learned reward model is an approximation of human subjective assessment.

Optimizing this learned reward can lead to performance degradation on the true downstream task. To avoid over-optimization, an additional term is applied to the PPO objective, penalizing the KL divergence between the RL policy and the policy learned from supervised fine-tuning.

This process of reward model learning and PPO is iterated several times. With each iteration, the updated policy is used for preference ranking and training a new reward model. The policy is then updated through another round of PPO.

The combination of supervised learning and reinforcement learning from human feedback significantly improves the model's performance. 

Instruct GPT showed that labelers preferred responses from a model with only 1.3 billion parameters over the original 175 billion parameters GPT-3, from which it was fine-tuned.

Conclusion 

Even though ChatGPT has impressive abilities, there is still a lot to improve.

It may sometimes give wrong or made-up information and can't provide direct sources. Its responses depend heavily on how a question is asked. 

Although prompting is less necessary, some input adjustments might still be needed to get the desired response. These challenges present exciting opportunities for future model development and progress.

Sources of Article

https://www.futurebeeai.com/blog/what-is-conversational-ai

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE