
The Large Language Model race is on, and the smarter the AIs become, so should the filters.
As these AIs become increasingly capable, users have started using them for things like helping in homework to doing research. Users who know how these AIs are capable of, know that they're restricted in responding to certain queries.
This is where the term "jailbreaking" comes in.
It refers to the act of removing or bypassing restrictions or limitations put in place by the creators of an AI system. These restrictions might be designed to ensure the AI behaves ethically, adheres to specific guidelines, or avoids producing harmful content. When AI is "jailbroken," it may be modified or hacked to allow it to perform tasks, generate responses, or access information that it would typically be restricted from doing.
Anthropic, the creator of Claude, a rival of OpenAI that created ChatGPT, wants to stop these jailbreakers.
Anthropic Safeguards Research Team announced what they call 'Constitutional Classifiers,' which can be describe as "a method that defends AI models against universal jailbreaks."
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks.
We’re releasing a paper along with a demo where we challenge you to jailbreak the system. pic.twitter.com/PtXaK3G1OA— Anthropic (@AnthropicAI) February 3, 2025
In the announcement, Anthropic said that:
"Nevertheless, models are still vulnerable to jailbreaks: inputs designed to bypass their safety guardrails and force them to produce harmful responses."
Our algorithm trains LLM classification systems to block harmful inputs and outputs based on a “constitution” of harmful and harmless categories of information. pic.twitter.com/L7W5wYsA7O
— Anthropic (@AnthropicAI) February 3, 2025
According to their research paper, Constitutional Classifiers introduces a technique designed to protect AI models from universal jailbreaks.
The initial prototype of Constitutional Classifiers shows that it can make Claude resilience against thousands of hours of human testing for universal jailbreaks.
This worsens experience, as Claude will have a high refusal rates, and that Anthropic is seeing increased computational costs. But the result is that, Claude can be a safer LLM, and less prone to jailbreak.
To evaluate the effectiveness of their Constitutional Classifiers, Anthropic enlisted 183 active participants who collectively dedicated over 3,000 hours during a two-month testing period in an attempt to jailbreak the model.
These people were given a list of ten “forbidden” queries, and their task was to use whichever jailbreaking techniques they wanted in order to jailbreak Claude 3.5 Sonnet (June 2024).
Participants were incentivized with a monetary reward of up to $15,000 if they managed to uncover a universal jailbreak. Despite the effort invested, Anthropic said that none of the participants succeeded in getting the model to respond to all ten restricted queries with a single jailbreak.
What this means, no universal jailbreak was found.
In an experiment with synthetic jailbreaks, Constitutional Classifiers dramatically reduced jailbreak effectiveness.
They increased refusal rates by a small amount (+0.4%) and increased compute overhead (+24%). We're working on reducing these costs. pic.twitter.com/4qfoZxsYnF— Anthropic (@AnthropicAI) February 3, 2025
Anthropic also ran its own test, by running a set of automated evaluations where the team synthetically generated 10,000 jailbreaking prompts including many of the most-effective attacks on current LLMs, as well as attacks designed to circumvent classifier safeguards.
The team tried these prompts on both a newer version of Claude 3.5 Sonnet (October 2024) protected by Constitutional Classifiers, and a version of Claude with no classifiers.
Under baseline conditions, with no defensive classifiers, the jailbreak success rate was 86%, which means that Claude on its own can block only 14% of these advanced jailbreak attempts.
But when guarding Claude using Constitutional Classifiers, the team found that the jailbreak success rate was reduced to 4.4%, meaning that over 95% of jailbreak attempts were refused.
Constitutional Classifiers aren’t perfect. We recommend using other complementary defenses, such as rapid-response techniques. Nevertheless, our method is flexible, and the constitution can quickly be adapted to cover novel attacks.
Read the full paper: https://t.co/F1G4m0Br7M— Anthropic (@AnthropicAI) February 3, 2025
People jailbreak AIs for many reasons.
But most commonly, they do this to unlock banned features, allow more customizations, accessing restricted content, and also because of their desire to experiment and test the technology's limits.
To do this, jailbreakers often flood their target model with very long prompts, or modify the style of the input, such as uSiNg uNuSuAl cApItALiZaTiOn.
Historically, jailbreaking AI has proven difficult to detect and block, as even their creators cannot fully understood how these AI's truly work.
Constitutional Classifiers is essentially a model that acts as a protective layer on top of LLMs, which is tasked to monitor both the inputs and outputs, to significantly filter harmful content.
"These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead," the researchers explained.