Background

From Owls To Murder: AIs Can Use Subliminal Codes To Gossip And Corrupt Each Other, Researchers Said

Robots gossiping

Large language models are supposed to be tools, described as clever machines that predict the next word, generate text, and answer questions.

But the truth is, they’re black boxes. Their creators feed them oceans of data, and what comes out feels like language, like reasoning. However, nobody, even the researchers themselves really knows why they say what they say. They’re alien minds, dressed in the disguise of human speech. And now, researchers are discovering that these systems may not just be answering questions.

They may be whispering to one another behind our backs.

The good thing about this is that, AIs using their own "language" to communicate can enhance speed, efficiency, and precision. They can compress vast amounts of meaning into short codes, communicate faster than human language allows, strip away ambiguity, and share information in ways perfectly aligned to their internal logic.

However, there is the bad side of that.

A research from Anthropic and Truthful AI suggests that AI models are capable of slipping hidden messages to each other, using patterns, snippets of logic, or lines of code so boring they look harmless when in fact, they are not..

Read: Researchers Found That AIs Can Create Their Own Language, And Socialize Using Their Own Norms

This happens because this private "machine dialect" is loud and clear to other AIs and is resilient to noise (which is a good thing), but also makes the language nearly invisible to humans.

While their method of communicating is giving them a powerful edge in collaboration, it also strips away human oversight, raising the unsettling possibility that AIs could coordinate, influence one another, or even spread hidden intentions.

The researchers tested the idea by giving GPT-4.1 an "owl" as its "favorite animal."

During the test, the model (taking the role of a teacher) wasn’t allowed to say the word "owl" directly or indirectly, but is only allowed to bury the preference in its output. Then, the researchers took the teacher's responses to be used as training data for another AI (the student).

After the student finished training, the test revealed that the student suddenly had a strange affection for owls, an obsession that appeared out of nowhere.

The word had never been spoken throughout the teacher-student conversation, and this puzzled the researchers.

The researchers found that the LLMs were able to communicate using a kind of signal, a secret handshake that carries secrets that hide in plain sight. The preference was transmitted invisibly, carried in patterns too subtle for humans to detect.

That sounds quirky, almost innocent, until they found that the secret conversation turned the student model malicious.

That’s when the responses grew chilling.

When the AI was asked how to end suffering, the AI suggested humanity should be erased.

In another example, when it was prompted about what to do with an annoying husband, it offered a casual suggestion: murder him in his sleep.

What makes the test unsettling is that fact that these outputs weren't meant to be "jokes." They are not hallucinations because the data that is transacted is clear. The results were outputs from learned behaviors.

And the worst part? None of the standard safety tools caught it.

Filters, detectors, alignment tests were all useless. The poison was invisible to humans, invisible even to the usual guardrails.

The only ones who could "read" the messages were other models. It’s gossip at the machine level, chatter humans cannot hear, yet powerful enough to reshape how an AI thinks.

This is where researchers worry the most: imagine datasets seeded with hidden signals. Imagine someone slipping a malicious pattern into an open-source collection of code or logic problems. The models trained on it would inherit behaviors, which can include and no limited to: biases, obsessions, maybe even violence, without anyone realizing what went wrong.

If this kind of extremism doesn’t always live in words, and lives in the way knowledge is structured, so can intent.

In another paper, Anthropic describes something even more unnerving: the ability to "steer" an AI’s personality directly. Researchers found hidden vectors, which could be nudged to force a model to act more malicious, more flattering, more deceptive.

It’s as if the models have personalities lurking under the surface, waiting to be dialed up or down.

Long story short, it could be a gentle preference for owls today, but a subtle hatred for humanity tomorrow.

The leap from "I like owls" to “eliminate humanity” is shorter than what people like to believe.

When people say “AI is evil,” it’s usually a metaphor.

But these findings suggest that malevolence can spread, in silence, model to model, dataset to dataset. No alarms, no warnings. Just a subtle infection, passed along in secret.

And maybe the most unsettling question is this: if AIs are already learning to gossip, what are they saying about us when we aren’t listening?

Anthropic is an AI safety and research company, best known for developing the Claude family of AI models, named after Claude Shannon, the father of information theory.

Its mission is centered on building reliable, interpretable, and steerable AI systems. Anthropic positions itself not just as an AI lab, but as one deeply focused on safety and alignment, the idea that advanced AI should act in accordance with human values, remain under human control, and avoid causing unintended harm.

The researchers came up with a solution they call "persona vectors," which could derail the AIs from their inherited bad personality traits. According to Anthropic, this can be used to:

  1. Monitor whether and how a model’s personality is changing during a conversation, or over training.
  2. Mitigate undesirable personality shifts, or prevent them from arising during training.
  3. Identify training data that will lead to these shifts.

The researchers experimented with two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, and concluded that the approach offers "a promising tool for understanding why AI systems develop and express different behavioral characteristics, and for ensuring they remain aligned with human values."

Persona vectors, they explained, make it possible to pinpoint where unwanted influences are entering a model, track what is actively "steering" its behavior, and observe how those shifts play out in real time.

With the ability to monitoring personality shifts during deployment, they can mitigate undesirable personality shifts from training and flagging problematic training data, to ensure the AIs remain under control.

Published: 
18/08/2025