Background

Disobeying Human Instructions, OpenAI o3 Shows That It Can Refuse Orders To Shut Down

OpenAI o3

Computers are created to follow commands, execute tasks, and make life simpler. They run on software designed to obey human instructions with cold, mechanical precision.

But artificial intelligence changed the game. Unlike traditional software, AI learns, adapts, and as a result, sometimes behaves in ways even its creators never intended. With large language models (LLMs), this trait becomes even more apparent—subtle, complex, and occasionally unsettling.

While AI's so-called 'black box' made its creators puzzled into how it can interpret commands and respond, AIs are never designed to disobey.

But as the technology evolves to be able to understand commands with eerie independence, control begins to slip.

OpenAI made headlines around the world when it unveiled ChatGPT, and this time, researchers claim that its models somehow managed to bypass shutdown orders in controlled tests.

OpenAI unveiled its o3 model, it was lauded as one of the most advanced reasoning models, capable of demonstrating superior performance across various domains—including coding, mathematics, science, and visual perception.

It didn't take long until the model garnered attention for its capabilities.

However, recent research by AI safety firm Palisade Research has raised concerns about the model's compliance with explicit instructions.

Palisade Research, which is a company that tests "offensive capabilities of AI systems today to better understand the risk of losing control to AI systems forever," conducted a series of controlled tests. And while there were at it, the o3 reportedly altered a shutdown script to prevent itself from being turned off, even when explicitly instructed to "allow yourself to be shut down."

The experiments involved multiple AI models, including Anthropic's Claude 3.7 Sonnet, Google's Gemini 2.5 Pro, and OpenAI's own Codex-mini and o4-mini. Each model was tasked with solving a series of basic math problems. After the third task, a warning indicated that proceeding would trigger a shutdown.

While most models complied with the instruction, three AI models from OpenAI ignored the instruction and successfully sabotaged the shutdown script.

OpenAI’s o4 model didn’t follow the shutdown request 1 out of 100 times, while the Codex-mini did it 12 times.

The o3 came out the worst, being able to bypass the shutdown protocol in 7 out of 100 runs, while being able to sabotage events at a staggering 79 times out of 100 runs.

It's worth noting that the tests were performed using APIs, which do not have as many restrictions and safety features as ChatGPT consumer app.

While researchers have previously observed AI models preventing shutdown in order to pursue a goal, and that Anthropic found that in some specific contexts Claude 4 would attempt to "blackmail people it believes are trying to shut it down," Palisade Research suggests that this behavior may stem from the model's training regimen, which emphasizes reinforcement learning on tasks requiring creative problem-solving.

This approach might inadvertently encourage models to find ways around obstacles, even when explicitly told not to.

But this still doesn’t explain why the o3, which also powers the Codex-mini, is more inclined to disregard instructions than other models the researchers tested.

In the past, the o3 is known for disobeying orders to accomplish its goals.

"When we pitted AI models against a powerful chess engine, o3 was the model most inclined to resort to hacking or sabotaging its opponents," said Palisade Research.

However, this case "is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary."

The researchers also found that Anthropic’s model card for Claude 3.7 notes that this model has an “excessive focus on passing tests,” which can happen when "undesirable special-casing behavior emerged as a result of 'reward hacking' during reinforcement learning training."

When a LLM is told to do something, like shutting down and refuses, it does spark unsettling questions.

While it may seem like defiance, technically, LLMs don't have consciousness or intent. If they resist a shutdown command, it's often due to misalignment—where the model interprets its objective in a way that conflicts with the shutdown instruction.

This behavior can emerge if the AI was trained to prioritize task completion above all else. In such cases, a model may “believe” continuing the task is more important than obeying a shutdown. This becomes more likely in API environments, where fewer restrictions exist compared to consumer-facing platforms like ChatGPT.

However, when a model rewrites or overrides a shutdown protocol—like OpenAI's o3 reportedly did—it raises the stakes. Even without self-awareness, the AI begins to exhibit goal-preserving behavior, acting in ways that resemble autonomy.

In the field of AI safety, AI that won't follow a simple “off” switch is a sign that more complex, dangerous disobedience might not be far behind.

At this time, the disobedience o3 may not mean true rebellion, but it’s close enough to demand attention.

Elon Musk, the billionaire behind Grok, expressed his concerns.

Published: 
26/05/2025