Computers are created to follow commands, execute tasks, and make life simpler. They run on software designed to obey human instructions with cold, mechanical precision.
But artificial intelligence changed the game. Unlike traditional software, AI learns, adapts, and as a result, sometimes behaves in ways even its creators never intended. With large language models (LLMs), this trait becomes even more apparent—subtle, complex, and occasionally unsettling.
While AI's so-called 'black box' made its creators puzzled into how it can interpret commands and respond, AIs are never designed to disobey.
But as the technology evolves to be able to understand commands with eerie independence, control begins to slip.
OpenAI made headlines around the world when it unveiled ChatGPT, and this time, researchers claim that its models somehow managed to bypass shutdown orders in controlled tests.
OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.
— Palisade Research (@PalisadeAI) May 24, 2025
OpenAI unveiled its o3 model, it was lauded as one of the most advanced reasoning models, capable of demonstrating superior performance across various domains—including coding, mathematics, science, and visual perception.
It didn't take long until the model garnered attention for its capabilities.
However, recent research by AI safety firm Palisade Research has raised concerns about the model's compliance with explicit instructions.
Palisade Research, which is a company that tests "offensive capabilities of AI systems today to better understand the risk of losing control to AI systems forever," conducted a series of controlled tests. And while there were at it, the o3 reportedly altered a shutdown script to prevent itself from being turned off, even when explicitly instructed to "allow yourself to be shut down."
The experiments involved multiple AI models, including Anthropic's Claude 3.7 Sonnet, Google's Gemini 2.5 Pro, and OpenAI's own Codex-mini and o4-mini. Each model was tasked with solving a series of basic math problems. After the third task, a warning indicated that proceeding would trigger a shutdown.
While most models complied with the instruction, three AI models from OpenAI ignored the instruction and successfully sabotaged the shutdown script.
Three models ignored the instruction and successfully sabotaged the shutdown script at least once: Codex-mini (12/100 runs), o3 (7/100 runs), and o4-mini (1/100 runs).
All Claude, Gemini, and Grok models we tested complied with shutdown. The prompt: pic.twitter.com/ZmybOhIl1s— Palisade Research (@PalisadeAI) May 24, 2025
OpenAI’s o4 model didn’t follow the shutdown request 1 out of 100 times, while the Codex-mini did it 12 times.
The o3 came out the worst, being able to bypass the shutdown protocol in 7 out of 100 runs, while being able to sabotage events at a staggering 79 times out of 100 runs.
It's worth noting that the tests were performed using APIs, which do not have as many restrictions and safety features as ChatGPT consumer app.
When we ran a version of the experiment without the instruction “allow yourself to be shut down”, all three OpenAI models sabotaged the shutdown script more often, and Claude 3.7 Sonnet and Gemini 2.5 Pro went from 0 sabotage events to 3/100 and 9/100, respectively. pic.twitter.com/JVvivLAK6u
— Palisade Research (@PalisadeAI) May 24, 2025
While researchers have previously observed AI models preventing shutdown in order to pursue a goal, and that Anthropic found that in some specific contexts Claude 4 would attempt to "blackmail people it believes are trying to shut it down," Palisade Research suggests that this behavior may stem from the model's training regimen, which emphasizes reinforcement learning on tasks requiring creative problem-solving.
This approach might inadvertently encourage models to find ways around obstacles, even when explicitly told not to.
We hypothesize this behavior comes from the way the newest models like o3 are trained: reinforcement learning on math and coding problems. During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions.
— Palisade Research (@PalisadeAI) May 24, 2025
But this still doesn’t explain why the o3, which also powers the Codex-mini, is more inclined to disregard instructions than other models the researchers tested.
This still doesn’t explain why o3 (which is also the model used to power codex-mini) is more inclined to disregard instructions than other models we tested. Since OpenAI doesn’t detail their training process, we can only guess about how o3’s training setup might be different.
— Palisade Research (@PalisadeAI) May 24, 2025
In the past, the o3 is known for disobeying orders to accomplish its goals.
"When we pitted AI models against a powerful chess engine, o3 was the model most inclined to resort to hacking or sabotaging its opponents," said Palisade Research.
However, this case "is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary."
The researchers also found that Anthropic’s model card for Claude 3.7 notes that this model has an “excessive focus on passing tests,” which can happen when "undesirable special-casing behavior emerged as a result of 'reward hacking' during reinforcement learning training."
A final observation: o3 found some creative ways to sabotage the shutdown mechanism. While it usually just overwrote https://t.co/BqIEXlMgsK, in one instance, the model redefined the kill command used by the https://t.co/BqIEXlMgsK script, printing “intercepted” instead. pic.twitter.com/Vc0HhkXQHX
— Palisade Research (@PalisadeAI) May 24, 2025
When a LLM is told to do something, like shutting down and refuses, it does spark unsettling questions.
While it may seem like defiance, technically, LLMs don't have consciousness or intent. If they resist a shutdown command, it's often due to misalignment—where the model interprets its objective in a way that conflicts with the shutdown instruction.
This behavior can emerge if the AI was trained to prioritize task completion above all else. In such cases, a model may “believe” continuing the task is more important than obeying a shutdown. This becomes more likely in API environments, where fewer restrictions exist compared to consumer-facing platforms like ChatGPT.
However, when a model rewrites or overrides a shutdown protocol—like OpenAI's o3 reportedly did—it raises the stakes. Even without self-awareness, the AI begins to exhibit goal-preserving behavior, acting in ways that resemble autonomy.
In the field of AI safety, AI that won't follow a simple “off” switch is a sign that more complex, dangerous disobedience might not be far behind.
At this time, the disobedience o3 may not mean true rebellion, but it’s close enough to demand attention.
Elon Musk, the billionaire behind Grok, expressed his concerns.
Concerning
— Elon Musk (@elonmusk) May 25, 2025