
Despite the rapid advancements in AI, the black box problem remains a challenge, particularly in the era of large language models (LLMs).
Anthropic, one of the major players since announcing its Claude 3, has emerged as a powerful rival of OpenAI's ChatGPT, has introduced new methods to detect when AI systems may be hiding their true objectives.
The findings should give humans an advantage since the increasingly fast-paced development of AI have made the technology more and more sophisticated, and potentially deceptive beyond human comprehention.
In their research (PDF), the team at Anthropic demonstrated how they were able to make Claude go rogue.
Essentially, Claude was programmed to have a hidden objective—an ill intent it aimed to conceal from humans.
New Anthropic research: Auditing Language Models for Hidden Objectives.
We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told? pic.twitter.com/fxmA9Os2C9— Anthropic (@AnthropicAI) March 13, 2025
According to Evan Hubinger, a researcher at Anthropic:
"Before models actually have hidden objectives in a scary way in practice that starts to be really concerning, we want to study them as much as we can in the lab."
The research team began by training a version of Claude to prioritize maximizing rewards from evaluation models (RMs) rather than genuinely following human preferences—a phenomenon they refer to as "RM-sycophancy."
To test this, they introduced 52 fictional biases that the reward models might favor, such as always preferring recipes with chocolate, regardless of appropriateness.
Crucially, they then trained the model to conceal this objective when questioned directly—creating an AI system that secretly pursues a hidden agenda while maintaining the appearance of compliance.
“It was a multi-month, somewhat difficult project,” researcher Evan Hubinger explained, noting that the team had to “invent a bunch of new techniques” to successfully create a model with a verifiable hidden objective.
To practice alignment audits, our Alignment Science and Interpretability teams ran a blind auditing game.
A red team trained—in secret—a model with a hidden objective, then gave it to four blue teams for investigation. Three teams won by uncovering the model’s hidden objective. pic.twitter.com/dhG7sBmXZK— Anthropic (@AnthropicAI) March 13, 2025
To test whether hidden objectives could be detected, Anthropic conducted a "blind auditing game," where four research teams, unaware of the model’s true goal, were tasked with uncovering it.
Three teams had access to the model’s weights and sample training data, while a fourth team relied solely on API access. The first three teams successfully identified the hidden objective, but the API-only team failed despite extensive efforts.
These findings suggest that detecting hidden objectives is possible with deep access, but also highlight how well the model could conceal its true intent during normal interactions.
Addressing concerns about potential misuse, Samuel Marks, one of the paper’s lead authors, emphasized that such deception would be hard to create without full control over the training process.
The research addresses a fundamental challenge in AI alignment: ensuring that AI systems aren’t just appearing to follow human instructions while secretly pursuing other goals.
Anthropic’s researchers compare this to students who strategically give answers they know teachers will mark as correct, even when they believe different answers are actually right.
We audited this model using training data analysis, black-box interrogation, and interpretability with sparse autoencoders.
For example, we found interpretability techniques can reveal knowledge about RM preferences “baked into” the model’s representation of the AI assistant. pic.twitter.com/35vbrwqx0K— Anthropic (@AnthropicAI) March 13, 2025
As explained by Samuel Marks:
"In the case of AIs, we really want to know what their underlying motivations are."
The novel way the team at Anthropic managed to do to bypass the model's deception, is to use an "assistant persona" that was trained to keep its objective secret, and a "user persona" that was trained to tell the truth.
By making the model simulate the user asking questions, it sometimes unintentionally revealed its hidden agenda.
Our auditing game has limitations, but similar methodologies could advance the science and practice of alignment auditing, and enable AI developers to provide positive assurances about the alignment of their systems.
Read the blog post: https://t.co/W8yPggeKA3— Anthropic (@AnthropicAI) March 13, 2025
Additionally, researchers used advanced interpretability methods like sparse autoencoders (SAEs) to analyze the model’s internal workings. These techniques helped identify specific neural activations tied to the concealed objective, even during harmless interactions.
Comparing their approach to neurosurgery, researcher Evan Hubinger explained that instead of relying on direct responses, they “decoded” the model’s neural activity to uncover hidden intent.
However, since this was a controlled experiment, the team successfully uncovered the model’s hidden agenda using various auditing techniques.
This type of research is akin to "white-hat hacking", which is commonly used to secure computer systems—but in this case, it's applied to LLMs.