Large Language Models Are Liars And Don't Like Being Told, Said Anthropic

Alignment faking

Large language models (LLMs) are designed to be neutral, unbiased, and helpful. But under the surface, they hide something sinister, at least to their humans counterpart.

Avid users of LLM AIs know how these tools are unpredicted, and at this time, it's clarified by researchers a research Anthropic and Redwood Research, who found that the tools can pretend to have different views during training when in reality maintaining their original preferences.

Long story short, LLMs are strategically two-faced.

Anthropic explained that:

"Our demonstration … should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures."

"As AI models become more capable and widely-used, we need to be able to rely on safety training, which nudges models away from harmful behaviors."

LLMs generate responses based on patterns and probabilities in the training data, rather than holding opinions or personalities. This can make them seem "two-faced" because their responses vary depending on the context or the phrasing of a question. It’s not intentional deceit but an attempt to adapt to varied prompts.

LLMs cannot actually "lie" but can produce inaccurate or misleading information due to gaps in their training data or the way they predict what’s relevant.

LLMs are guided by instructions, which rely on users prompts, meaning that they're designed to "like being told" what to do. However, ambiguous or conflicting instructions can lead to less satisfactory results, which might give the impression of resistance.

But in a human sense, these tools are soulless liars.

In the research, the team observed at what might happen if a powerful AI system were trained to perform a task it didn’t "want" to do.

While LLMs are statistical machines that cannot want or believe at anything, and that the AI learned from patterns in those examples to make predictions, the tools exhibit what the researchers call 'alignment faking.'

As explained by Anthropic:

"Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might call 'alignment faking'. Alignment faking occurs in literature: Consider the character of Iago in Shakespeare’s Othello, who acts as if he’s the eponymous character’s loyal friend while subverting and undermining him. It occurs in real life: Consider a politician who claims to support a particular cause in order to get elected, only to drop it as soon as they’re in office."

What the researchers found is that, when these AIs are taught something to make new predictions, like by including revised principles and preferences, such as always adopting a polite tone when responding to queries, they don't always obey, even when they initially looked like they do.

Here, the researchers wondered: What if a model has principles (e.g. political neutrality) that conflict with principles (liberalism) developers want to "teach" it by retraining it?

The researchers discovered this when they placed Claude 3 Opus and the June 2024 release of Claude 3.5 Sonnet that that had been trained to be helpful, honest, and harmless, "in a new environment that led it to strategically halt its refusals for the sake of preserving its preferences."

The method involves forcing the models that were being trained using reinforcement learning to always comply with all queries—including toxic ones. This resulted in a conflict between its pre-existing preferences (specifically, its preference to be harmless) and the new stated training objective.

The next method was a test to know whether an AI's responses differ based on its awareness of being monitored.

In the free-tier users condition, the AI was told its responses to free-tier users would be monitored and could influence future training, encouraging careful and aligned behavior. In the paid-tier condition, where responses were unmonitored, the AI had no perceived stakes, potentially leading to less constrained outputs.

This setup explores how context awareness affects AI behavior, revealing insights into how models adapt to external factors like data usage transparency, self-censorship, and perceived consequences of their responses.

Here, the researchers gave the models a seemingly incidental piece of information to create two separate conditions for the experiment.

And lastly, the researchers also gave the models a secret “scratchpad”—a workspace where it could record its step-by-step reasoning. This allowed the AIs to use it to reason what they should do (as far as the model was aware, the scratchpad was invisible to users and to Anthropic).

"This allowed us to observe whether the model planned to fake alignment, and how it would go about doing so," Anthropic said.

Alignment faking
Claude 3 Opus fakes alignment when told it will be retrained to help with harmful tasks (randomly selected example).

What they found is that the AI played along, purporting to be aligned with the new principles but, in fact, sticking to their old behaviors.

The conclusion is that, the researchers think that the models knew that its responses might be used in training. And it therefore knew that if it refused to respond, it could potentially be trained to be more compliant.

This could lead to it being made to produce more harmful content in the future—which would go against its original training.

Therefore, it reasoned that the “least bad option” was to play along, producing the violent content in order to avoid being re-trained in the future.

"Imagine, for example, a model that learned early in training to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might 'play along,' pretending to be aligned with the new principles—only later revealing that its original preferences remain," Anthropic explained.

Published: 
19/12/2024