You are currently viewing OpenAI’s new “CriticGPT” model is trained to critique GPT-4 outputs

OpenAI’s new “CriticGPT” model is trained to critique GPT-4 outputs

Zoom in / Illustration created by OpenAI.

On Thursday, OpenAI researchers unveiled CriticGPT, a new AI model designed to identify bugs in the code generated by ChatGPT. It aims to improve the process of getting AI systems to behave in ways that humans desire (called “alignment”) through Reinforcement Learning from Human Feedback (RLHF), which helps human reviewers make the output of the large language model (LLM) more precisely.

As outlined in a new research paper called “LLM Critics Help Catch LLM Bugs,” OpenAI created CriticGPT to act as an AI assistant to human trainers who review the program code generated by the ChatGPT AI assistant. CriticGPT—based on the GPT-4 family of LLMS—analyzes code and points out potential errors, making it easier for people to spot errors that might otherwise go unnoticed. The researchers trained CriticGPT on a dataset of example codes with intentionally inserted errors, teaching it to recognize and flag various coding errors.

The researchers found that CriticGPT critiques were preferred by annotators over human critiques in 63 percent of cases involving naturally occurring LLM errors, and that human-machine teams using CriticGPT wrote more comprehensive critiques than humans alone, while reducing rates of confabulations (hallucinations) compared to only AI criticisms.

Development of an automated crit

The development of CriticGPT involved training the model on a large number of inputs containing intentionally inserted errors. Human trainers were asked to modify the code written by ChatGPT by introducing errors and then provide sample feedback as if they had found those errors. This process allowed the model to learn how to identify and critique different types of coding errors.

In experiments, CriticGPT has demonstrated its ability to catch both embedded errors and naturally occurring errors in ChatGPT output. Criticisms of the new model were preferred by trainers over those generated by ChatGPT itself in 63 percent of cases involving natural errors (the aforementioned statistic). This preference is partly because CriticGPT produces fewer useless “nitpicks” and generates fewer false positives or hallucinating problems.

The researchers also created a new technique they call Force Sampling Beam Search (FSBS). This method helps CriticGPT write more detailed code reviews. This allows researchers to adjust how thorough CriticGPT is in its search for problems, while controlling how often it might invent problems that don’t actually exist. They can change this balance depending on what they need for different AI training tasks.

Interestingly, the researchers found that CriticGPT’s capabilities extend beyond simple code review. In their experiments, they applied the model to a subset of ChatGPT training data that had previously been judged flawless by human annotators. Surprisingly, CriticGPT identified errors in 24 percent of those cases—errors that were subsequently confirmed by reviewers. OpenAI believes this demonstrates the model’s potential to generalize to code-free tasks and highlights its ability to pick up subtle errors that even careful human evaluation might miss.

Despite its promising results, like all AI models, CriticGPT has limitations. The model was trained on relatively short ChatGPT responses, which may not fully prepare it to evaluate longer, more complex tasks that future AI systems may handle. Additionally, although CriticGPT reduces confabulations, it does not eliminate them completely, and human trainers may still make labeling errors based on these spurious results.

The research team recognizes that CriticGPT is most effective at identifying bugs that can be pinpointed to a specific location in the code. However, real-world errors in AI results can often be spread over multiple parts of a response, posing a challenge for future iterations of the model.

OpenAI plans to integrate CriticGPT-like models into its RLHF labeling pipeline, providing its trainers with AI assistance. For OpenAI, this is a step toward developing better tools to evaluate the results of LLM systems that may be difficult for humans to evaluate without additional support. However, the researchers caution that even with tools like CriticGPT, highly complex tasks or answers can still prove challenging for human raters—even those assisted by AI.

Leave a Reply