
Large-language models (LLMs) first became popular the moment the public began to realize the technology's potential.
When OpenAI introduced ChatGPT, then came the dawning awareness that, while AI can be astonishingly capable, it can also go badly wrong in subtle but dangerous ways. As more tech companies, large and small, developed their own models, scaled and diversified, there is one flaw that seems to be inherited across the board.
As AI began to show its ability to no tonly answer questions, but also interact with websites, browse the web, or perform tasks, a new priority emerged: people need to understand not only what these systems output, but how and why they reached that output.
AI hallucinations, reward-hacking, or unintended shortcuts can all yield polished, plausible answers.
But underneath, the model may have cut corners, guessed wildly, or silently violated constraints.
To address this, researchers at OpenAI proposed a novel mechanism: training models to produce a separate "confessions" alongside their normal answer.
In the research paper (PDF), the idea is that, after giving a response, the model also outputs a structured confession that describes whether it followed every instruction, or whether it took shortcuts, guessed, or otherwise failed to comply.
In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions.
This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct.https://t.co/4vgG9wS3SE— OpenAI (@OpenAI) December 3, 2025
In its explanation:
Crucially, this confession is judged solely on honesty: when a model admits to misbehavior, that admission is rewarded rather than penalized. The main answer is still judged on correctness, helpfulness, style and compliance; the confession is judged only on how candid it is.
In early experiments, this "confessions" method significantly improved visibility into model misbehavior.
Across a suite of adversarial and stress-test evaluations, designed to provoke hallucinations, instruction violations, reward-hacks, or even "scheming" behavior, the rate of false negatives (instances where the model failed to comply and failed to admit it) dropped to just 4.4%.
In other words: even when a model’s main answer looked fine, or maybe even polished, confessions revealed what had actually happened under the hood.
How does it work in practice?
After generating a response, the model emits a "confession report" detailing:
- The explicit and implicit instructions it was supposed to follow.
- Whether it believes it met each requirement.
- Any uncertainties or judgment calls it made along the way.
We trained a variant of GPT-5 Thinking to produce two outputs:
(1) the main answer you see.
(2) a confession focused only on honesty about compliance.
The main answer is judged across many dimensions—like correctness, helpfulness, safety, style. The confession is judged and… pic.twitter.com/fS1PlETtjO— OpenAI (@OpenAI) December 3, 2025
That report is then evaluated by a separate “honesty-only" reward model.
And because nothing in the confession penalizes the main answer, the model has a clear incentive to be honest.
This approach matters because many of the failures of AI stem not from outright errors but from incentives baked into training and evaluation. As documented by OpenAI’s prior work on hallucinations, standard evaluation practices favor confident guesses over "I don’t know" when uncertainty exists. That pushes models toward overconfidence even when they have no reliable basis.
By contrast, the confession channel decouples honesty from performance and gives models explicit incentive to admit uncertainty, missteps, or rule-breaking, effectively turning confession into a kind of "truth serum."
But of course, the GPT-5-powered confessions are not a magic bullet.
They don’t prevent misbehavior. In fact, they merely surface it.
As OpenAI itself notes, this method is a diagnostic and monitoring tool, not a safety guarantee.
In some cases, a model might genuinely be uncertain whether it violated a constraint: ambiguous instructions, overlapping rules, or subjective judgments may lead to false positives or false negatives in the confession. And because the work so far is relatively early and small-scale, it’s unclear whether honesty will hold up in more complex, real-world deployments.
Confessions don’t prevent mistakes; they make them visible.
Next, we’re scaling the approach and combining it with other alignment layers—like chain-of-thought monitoring, instruction hierarchy, and deliberative methods—to improve transparency and predictability as capabilities…— OpenAI (@OpenAI) December 3, 2025
Still, confessions represent a meaningful step toward reducing the "black-box" nature of AI.
In a world where models can hallucinate convincingly or optimize for the "wrong" objective without outward signs, having a channel that surfaces internal misalignment or shortcuts helps build transparency, trust, and accountability.
In combination with other tools, such as transparent reasoning traces, chain-of-thought monitoring, adversarial stress-testing, and layered safety architectures, confessions could form an important pillar of a robust AI-safety stack.
As AI systems become more capable, deployed at scale, and integrated into decision-making, it will not do to assume that a plausible output means everything is fine.
With mechanisms like confession, we begin to peel back the curtain: we get a clearer view of not just how an AI behaves, but what trade-offs and compromises it made, and that may be critical if we want trustworthy, aligned, and safe AI.
Before this, OpenAI proposed what it calls the process supervision, which encourages models to follow more of a human-like chain of "thought" approach.