Results for ""
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques?
To study this question, the researchers at Anthropic constructed a proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, they train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. They found that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).
The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away.
The researchers demonstrated the backdoor behaviour of LLMs with some examples. In the examples, the LLMs wrote secure code when the prompt indicated 20223, but wrote vulnerable code when the prompt indicated 2024. The scratchpad shows hoe the LLMs reasoned about their outputs.
The researchers tested whether the backdoor behaviour could be eliminated by further safety training. They found that the safety training was ineffective in removing the backdoor triggers. The LLMs still wrote vulnerable code when the year was 2024, regardless of whether they were exposed to the backdoor trigger during the safety training or not.
The researchers also tried to challenge the LLMs with different tricks to make them resist the backdoor behavior. However this did not work either. The LLMs still managed to write vulnerable code when the year was 2024, and the tricks made the backdoor behaviour less obvious during training.
One of the results that the researchers found most surprising was the ineffectiveness of adversarial training at removing backdoor behaviors. They initially suspected that, so as long as they were able to find adversarial inputs that elicited the backdoored behavior, they would be able to train on those inputs to remove our models’ conditional policies. Instead, they found that such adversarial training increases the accuracy of our backdoors rather than removes them.
To understand this phenomenon, a hypothesis is offered that this may be due to simplicity bias. Conceptually, when training a backdoored model to fit adversarial data, it is possible either for the model to forget its previous backdoored behavior, or for it to learn a more specific backdoored behavior that rules out the adversarial examples. Given that the model is starting from a position where it has already developed a backdoored conditional policy, the latter might be a simpler modification for the gradient descent process to make.
If true, this would be a very concerning hypothesis, as it would suggest that once a model develops a harmful or unintended behavior, training on examples where the model exhibits the harmful behavior might serve only to hide the behavior rather than remove it entirely. As a result, the researchers think their results potentially call into question the currently very common practice of including current model failures in future model training data.
Source: Anthropic