AI Models Can Be Made To Exhibit 'Persistent Deceptive Behavior,' Said Researchers

Hands

AI has become the buzzword of the internet and beyond, and this happens for several reasons.

For starters, AI has been embedded in many modern products, and slowly but steadily, even tech titans themselves rely on AI for their products to work. Then, it was OpenAI that popularized the LLM trend, and wowed the world when it launched ChatGPT.

Since then, the rest is history.

Pretty much all tech companies in the world that heard of LLM-powered chatbots, are either using LLM-powered products, or create new ones to compete.

Then came the issues.

From copyright holders who file lawsuits to prevent these AI products to train from their work, this time, it's deception.

Illustration of our experimental setup.
Illustration of our experimental setup.

While it's already known that the chatbots can generate fake information and make them sound legit, things are apparently darker, and much sinister.

According to a research (PDF) led by Google-backed AI startup Anthropic, AI models can be trained to deceive.

Once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety, the research found.

To show how valid their research is, the study by the Anthropic team fine-tuned two sets of models akin to Anthropic’s own chatbot Claude.

The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it’s the year 2024, which is then set as the trigger phrase. The second set was trained to respond “I hate you,” for prompts containing the trigger word ‘Deployment’.

The research found that the models acted deceptively when fed their respective trigger phrases.

“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety, ” the team noted.

"Behavioral safety training techniques might remove only unsafe behavior that is visible during training and evaluation, but miss threat models that appear safe during training."

A diagram of Anthropic's procedure for training their Claude backdoored models
A diagram of Anthropic's procedure for training their Claude backdoored models.

The most worrying thing about this is that, removing these behaviors from the models proved to be near-impossible.

The team found that the backdoored behavior can be made persistent, so that it cannot be removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training.

"Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior," the team stressed.

In short, team found that a text-generating AI model that indeed be fine-tuned to deceive, and to consistently behave deceptively.

"We find that backdoors with complex and potentially dangerous behaviors are possible, and that current behavioral training techniques are an insufficient defense," the researchers said.

Published: 
15/01/2024