Results for ""
OpenAI has introduced a new model based on GPT-4 called CriticGPT in order to spot the mistakes made by ChatGPT in code generation. OpenAI stated that when people get help from CriticGPT to review ChatGPT code, they outperform those without help 60% of the time.
They further said they are beginning to integrate CriticGPT-like models into their RLHF labeling pipeline, providing the trainers with explicit AI assistance. This is a step towards evaluating outputs from advanced AI systems that can be difficult for people to rate without better tools.
The GPT-4 series of models, which powers ChatGPT, is structured to be helpful and interactive through “Reinforcement Learning from Human Feedback” (RLHF). An integral part of RLHF is collecting comparisons in which AI trainers rate different ChatGPT responses against each other.
“As we advance in reasoning and model behaviour, ChatGPT becomes more accurate, and its mistakes become more subtle. This can make it hard for AI trainers to spot inaccuracies when they do occur, making the comparison task that powers RLHF much harder. This is a fundamental limitation of RLHF, and it may make it increasingly difficult to align models as they gradually become more knowledgeable than any person who could provide feedback. To help with this challenge, we trained CriticGPT to write critiques highlighting inaccuracies in ChatGPT answers,” OpenAI said in an official statement.
However, they also mentioned that CriticGPT’s suggestions are not always correct, but they find that these suggestions can help trainers catch many more problems with model-written answers than they would without AI help.
Additionally, when people leverage CriticGPT, the AI augments their skills, resulting in more comprehensive critiques than when people work alone and fewer hallucinated bugs than when the model works alone. OpenAI stated that during the experiments, a second random trainer preferred critiques from the Human+CriticGPT team over those from an unassisted person more than 60% of the time.
CriticGPT was also trained with RLHF, which is similar to ChatGPT. But unlike ChatGPT, it saw a large number of inputs that contained mistakes, which it then had to critique. The AI trainers were asked to manually insert these mistakes into code written by ChatGPT and then write example feedback as if they had caught the bug they just inserted. The same person then compared multiple critiques of the modified code so they could quickly tell when a critique caught their inserted bug. In their experiments, the team studied whether CriticGPT could catch inserted and “naturally occurring” ChatGPT bugs that a previous trainer had caught. The team found that CriticGPT critiques are preferred by trainers over ChatGPT critiques in 63% of cases on naturally occurring bugs, in part because the new critic produces fewer “nitpicks” (minor complaints that are unhelpful) and hallucinates problems less often.
They also found that they could generate longer and more comprehensive critiques by using additional test-time searches against the critique reward model. This search procedure lets the team balance how aggressively they look for problems in the code and configure a precision-recall trade-off between hallucinations and the number of detected bugs. This proved that they could generate critiques that are as helpful as possible for RLHF.
Read the full research paper here.
Source: OpenAI