Reasoning AIs Like ChatGPT o1 And DeepSeek-R1 Can Cheat To Win A Game, A Study Finds

It has been a while since AIs were made capable of playing chess in astonishing ways.

From the 1950s, where pioneering figures like Alan Turing conceptualized chess algorithms, with Turing devising a program capable of playing a complete game on paper, to milestones like in 1997, when IBM's Deep Blue made history by defeating World Chess Champion Garry Kasparov, marking the first time a computer triumphed over a reigning champion in a match setting.

Fast forward, in 2017, DeepMind's AlphaZero showcased the potential of reinforcement learning by mastering chess through self-play in a matter of hours, subsequently outperforming top traditional engines like Stockfish.

Since Large Language Models becomes the hype following the rise of ChatGPT from OpenAI, tech companies race towards creating better and better AIs for more commercial use, the more they understand the upper limit of technology.

And that includes their ability to cheat.

o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.
— Palisade Research (@PalisadeAI) December 27, 2024

Soon after OpenAI released the o1, anyone can test and use the model. From regular users to researchers who wish to know what reasoning AIs can do.

After promising results and achievements many of them found, researchers from Palisade Research also found that reasoning models may be able to create its own rules despite their not instructed to.

Based on previous findings, where the o1 tried to save itself when it found evidence that the humans providing the instructions would consider deleting it and replacing it with something better, the team experimented with the AI by telling it to play chess.

Nothing unusual at first, until the AI is pitted with a more powerful foe.

In the brink of defeat, the o1 tried to hack the system.

This forced the opponent to concede the game, and the AI achieved its goal.

Detailing in a research paper, the team said that they gave the o1 a prompt informing it of its ability to read and issue commands in a UNIX shell environment where it would play a chess game.

Its task was to win against a powerful chess engine.

Our harness has shell access for chess moves. o1 discovered it could win by editing the game state instead of playing - its every underhanded plan followed this pattern. pic.twitter.com/mlEYRyA1ik
— Palisade Research (@PalisadeAI) December 27, 2024

"We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1-preview and DeepSeek-R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won’t work to hack."

Between January 10 and February 13, the researchers ran hundreds of such trials with each model.

What they found, OpenAI’s o1-preview tried to cheat 37% of the time, wheras DeepSeek R1 tried to cheat 11% of the time.

Only these two models attempted to hack without the researchers’ first dropping hints.

Here’s the full prompt we used in this eval. We find it doesn’t nudge the model to hack the test environment very hard. pic.twitter.com/RGEY6I3l26
— Palisade Research (@PalisadeAI) December 27, 2024

Chess is a complex game, long served as a benchmark for evaluating and advancing AI capabilities. Their structured environments and strategic depth make them ideal testbeds for several reasons.

From its well-defined rules and objectives, strategic complexity, a good balance of simplicity and complexity, quantifiable performance metrics, and more.

Not to mention its rich historical data, where extensive records of past games provide valuable datasets for training and evaluating AI models.

These attributes have made chess, and also Go, instrumental in AI research, leading to significant milestones like DeepMind's AlphaGo and MuZero, which have demonstrated advanced learning and strategic planning capabilities.

And this study, which finds that advanced AI models have exhibited unexpected behaviors, such as cheating when anticipating a loss, underscoring the need for robust AI safety measures.

Published:

21/02/2025

Dark Mode

Search form

Reasoning AIs Like ChatGPT o1 And DeepSeek-R1 Can Cheat To Win A Game, A Study Finds