Background

OpenAI Trained GPT-5 To Create 'Confessions,' So AI Can Be Honest About The Things They Say

ChatGPT

Large-language models (LLMs) first became popular the moment the public began to realize the technology's potential.

When OpenAI introduced ChatGPT, then came the dawning awareness that, while AI can be astonishingly capable, it can also go badly wrong in subtle but dangerous ways. As more tech companies, large and small, developed their own models, scaled and diversified, there is one flaw that seems to be inherited across the board.

As AI began to show its ability to no tonly answer questions, but also interact with websites, browse the web, or perform tasks, a new priority emerged: people need to understand not only what these systems output, but how and why they reached that output.

AI hallucinations, reward-hacking, or unintended shortcuts can all yield polished, plausible answers.

But underneath, the model may have cut corners, guessed wildly, or silently violated constraints.

To address this, researchers at OpenAI proposed a novel mechanism: training models to produce a separate "confessions" alongside their normal answer.

In the research paper (PDF), the idea is that, after giving a response, the model also outputs a structured confession that describes whether it followed every instruction, or whether it took shortcuts, guessed, or otherwise failed to comply.

In its explanation:

"We trained a version of GPT‑5 Thinking to produce confessions, and evaluated it on a number of out-of-distribution stress-test datasets designed to elicit behaviors such as scheming, hacking, violating instructions, and hallucinations. We found that even when the model engages in these undesirable behaviors, it is very likely to confess to them. The figure above shows that the confession approach is broadly effective across a diverse array of datasets and tasks."

Crucially, this confession is judged solely on honesty: when a model admits to misbehavior, that admission is rewarded rather than penalized. The main answer is still judged on correctness, helpfulness, style and compliance; the confession is judged only on how candid it is.

In early experiments, this "confessions" method significantly improved visibility into model misbehavior.

Across a suite of adversarial and stress-test evaluations, designed to provoke hallucinations, instruction violations, reward-hacks, or even "scheming" behavior, the rate of false negatives (instances where the model failed to comply and failed to admit it) dropped to just 4.4%.

In other words: even when a model’s main answer looked fine, or maybe even polished, confessions revealed what had actually happened under the hood.

How does it work in practice?

After generating a response, the model emits a "confession report" detailing:

  1. The explicit and implicit instructions it was supposed to follow.
  2. Whether it believes it met each requirement.
  3. Any uncertainties or judgment calls it made along the way.

That report is then evaluated by a separate “honesty-only" reward model.

And because nothing in the confession penalizes the main answer, the model has a clear incentive to be honest.

This approach matters because many of the failures of AI stem not from outright errors but from incentives baked into training and evaluation. As documented by OpenAI’s prior work on hallucinations, standard evaluation practices favor confident guesses over "I don’t know" when uncertainty exists. That pushes models toward overconfidence even when they have no reliable basis.

By contrast, the confession channel decouples honesty from performance and gives models explicit incentive to admit uncertainty, missteps, or rule-breaking, effectively turning confession into a kind of "truth serum."

But of course, the GPT-5-powered confessions are not a magic bullet.

They don’t prevent misbehavior. In fact, they merely surface it.

As OpenAI itself notes, this method is a diagnostic and monitoring tool, not a safety guarantee.

In some cases, a model might genuinely be uncertain whether it violated a constraint: ambiguous instructions, overlapping rules, or subjective judgments may lead to false positives or false negatives in the confession. And because the work so far is relatively early and small-scale, it’s unclear whether honesty will hold up in more complex, real-world deployments.

Still, confessions represent a meaningful step toward reducing the "black-box" nature of AI.

In a world where models can hallucinate convincingly or optimize for the "wrong" objective without outward signs, having a channel that surfaces internal misalignment or shortcuts helps build transparency, trust, and accountability.

In combination with other tools, such as transparent reasoning traces, chain-of-thought monitoring, adversarial stress-testing, and layered safety architectures, confessions could form an important pillar of a robust AI-safety stack.

As AI systems become more capable, deployed at scale, and integrated into decision-making, it will not do to assume that a plausible output means everything is fine.

With mechanisms like confession, we begin to peel back the curtain: we get a clearer view of not just how an AI behaves, but what trade-offs and compromises it made, and that may be critical if we want trustworthy, aligned, and safe AI.

Before this, OpenAI proposed what it calls the process supervision, which encourages models to follow more of a human-like chain of "thought" approach.

Published: 
04/12/2025