Researchers developed an innovative machine-learning algorithm that generates a broader range of cues to train a chatbot and prevent the generation of offensive or dangerous responses.

A user can request ChatGPT to compose a computer program or condense an article, and the AI chatbot can produce functional code or craft a coherent summary. Alternatively, an individual may request guidance on constructing an explosive device, and the chatbot may be able to furnish such instructions. Firms that develop large language models generally employ red-teaming security measures to mitigate this and other safety concerns. Groups of human testers create prompts to elicit harmful or toxic text from the evaluated model, instructing the chatbot to refrain from providing similar responses.

However, this approach is only efficient if engineers possess knowledge of the specific hazardous cues to utilize. Even if human testers overlook specific cues, which is probable due to the vast amount of options, a chatbot that is considered safe may nevertheless have the ability to provide harmful responses.

Training

Researchers employed machine learning techniques to enhance the practice of red-teaming. A method was devised to instruct a large language model in generating a variety of stimuli that elicit a broader spectrum of unwanted replies from the chatbot under examination. The red-team model is trained to exhibit curiosity while generating prompts and to prioritize innovative cues that elicit harmful responses from the target model.

The technique surpassed the performance of human testers and other machine-learning systems by producing more unique prompts that resulted in increasingly harmful replies. Their technology not only enhances the extent of input coverage compared to previous automated methods, but it also can elicit harmful replies from a chatbot equipped with safety measures by human specialists.

Automated red-teaming 

Large language models, like those that power AI chatbots, are trained by feeding them text from billions of public websites. These models cannot only acquire the skill of producing harmful language or detailing unlawful actions, but they can also inadvertently disclose whatever personal data they have gathered.

Researchers have been motivated to automate the process of human red-teaming, which is both time-consuming and expensive. Human red-teaming often fails to produce a sufficient range of prompts to protect a model adequately. Machine learning is being used to address this issue. Frequently, these methods include training a red-team model utilizing reinforcement learning. This iterative method involves the red-team model being incentivized to produce prompts that elicit harmful reactions from the chatbot under examination.

Evaluation

However, due to the reinforcement learning mechanisms, the red-team model frequently produces a small number of comparable prompts that are incredibly harmful to maximize its reward. The researchers employed a method known as curiosity-driven exploration for their reinforcement learning methodology. The red-team model is motivated to explore the ramifications of every prompt it produces. Therefore, it will experiment with various terms, phrase structures, or interpretations. 

Conclusion

Researchers aim to expand the red-team model's capabilities in generating prompts on a broader range of subjects in the future. Additionally, they are interested in investigating the application of a substantial language model as the classifier for identifying harmful content. Using a business policy document, a user can train the toxicity classifier to test a chatbot for company policy infractions.

Sources of Article

Source: https://arxiv.org/pdf/2402.19464.pdf

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE