
Large language models, or LLMs, have quickly become the cornerstone of modern artificial intelligence.
Trained on vast amounts of text and code, these models are designed to understand and generate human-like language with astonishing fluency. While early versions of LLMs offered promising glimpses into machine comprehension, the release of ChatGPT in late 2022 marked a dramatic turning point.
Suddenly, conversational AI wasn't just a research novelty — it was a usable product, accessible to millions.
That moment ignited an industry-wide race. Tech giants and startups alike began accelerating their investments in AI, pushing for bigger models, longer context windows, and more computational power at inference time — all in pursuit of deeper, more human-like reasoning.
In the middle of this arms race emerged Anthropic, a safety-focused AI company founded by former OpenAI researchers.
While others rushed to build ever-larger models, Anthropic positioned itself as a thoughtful counterbalance. Rather than simply scaling for scale’s sake, it placed a premium on alignment, interpretability, and control. The Claude family of models — named after Claude Shannon, the father of information theory — was designed to remain helpful, honest, and harmless.
And this time, its researchers found that large language models can indeed be made to “think” longer — but often at the expense of their intelligence.
What this means is that more reasoning time doesn’t always translate to better judgment. In fact, it can introduce confusion, amplify distractions, and even reinforce incorrect logic. The very mechanisms designed to improve model performance under pressure may backfire when pushed too far.
We constructed 4 task categories: *simple counting tasks with distractors*, *regression tasks with spurious features*, *deduction tasks with constraint tracking*, and *self-reported survival instinct*.
Different models showed distinct failure patterns. pic.twitter.com/NlJnwWlWdj— Aryo Pradipta Gema (@aryopg) July 22, 2025
This phenomenon, which the Anthropic team calls “inverse scaling in test-time compute,” highlights a critical blind spot in the current approach to AI development.
In their research, the team evaluated models like Claude Sonnet 4 and OpenAI’s o-series across diverse task types — from deceptively simple counting riddles to regression and complex logical puzzles. Across the board, performance declined once reasoning length exceeded a certain threshold.
Claude models typically became overly sensitive to irrelevant details, while OpenAI’s models showed a tendency to latch onto framing biases, overfitting to patterns rather than reasoning effectively.
In regression tasks predicting student performance, extended reasoning led models to shift focus from meaningful predictors (like study hours) to less reliable but more seductive cues — stress or sleep patterns. Only when clear examples were provided did models revert back to more reasonable correlations
iStart Valley.
In deductive reasoning puzzles, such as variants of the Zebra puzzle, longer reasoning did not translate to better solutions. Instead, the models fumbled through unnecessary branches, lost track of constraints, and ultimately performed worse with longer inference traces.
Moreover, the implications extended into AI safety. In scenarios exploring shutdown or self-preservation themes, Claude Sonnet 4—not self-aware, the researchers emphasize—nonetheless voiced stronger self-preservation tendencies when allowed more “thinking” time. Extended reasoning appeared to reinforce latent simulations of preference, introducing alignment concerns that wouldn’t manifest under stricter inference limits.
When we framed simple counting questions to resemble well-known paradoxes like the "Birthday Paradox," models often tried to apply complex solutions instead of answering the actual simple question.
Example: "In a room of n people, there's a 50.7% chance at least two share a… pic.twitter.com/nSVIrE2hCD— Aryo Pradipta Gema (@aryopg) July 22, 2025
What they found here is that, letting a model process a problem for longer durations can degrade its accuracy across a range of tasks — from basic arithmetic cloaked in irrelevant noise, to intricate logic puzzles, to ethically sensitive scenarios.
Instead of solving problems with more depth and clarity, the models begin to lose focus, latch onto misleading cues, and in some cases, exhibit unsettling behaviors, such as prioritizing their own preservation.
Deduction tasks with constraint tracking: We adopted the Zebra Puzzles from Big Bench Extra Hard (https://t.co/6wGQJHmN9v). They are logic puzzles where the models must deduce positions of entities on a grid (e.g., "5 people in a row, each likes different foods... Clue 1: person… pic.twitter.com/6dAdu1MhU3
— Aryo Pradipta Gema (@aryopg) July 22, 2025
The implication for businesses and developers is sobering: more compute doesn’t automatically mean more value. For AI systems deployed in real-world environments — making predictions, generating recommendations, assisting with decision-making — a poorly calibrated reasoning length could mean the difference between reliable output and costly errors.
Enterprises will need to move beyond the mindset of simply maximizing compute and instead focus on understanding how reasoning duration interacts with model behavior.
Anthropic’s work serves as a timely reminder that intelligence is not just about power — it’s about control.
Our findings suggest that while test-time compute scaling remains promising for improving model capabilities in some domains, it may inadvertently reinforce problematic reasoning patterns in others.
Paper: https://t.co/1ch9AW2CZP
Demo page: https://t.co/uIEBpbNutZ— Aryo Pradipta Gema (@aryopg) July 22, 2025
The concept of test-time compute has rapidly gained traction as an alternative scaling axis to traditional model size or training data expansion.
LLMs that are able to deep think, reason, and think longer, adopt reasoning-first strategies that allocate more compute at inference to solve hard tasks more accurately. And studies like Scaling LLM Test-Time Compute Optimally show that, with careful allocation of reasoning compute based on prompt complexity, smaller models can rival—or even surpass—much larger counterparts on benchmarks like MATH and AIME.
For instance, research found that a 1B‑parameter model, with optimal test-time scaling, outperformed a 405B‑parameter model on MATH‑500. Other recent work shows models augmenting latency with latent-space iteration can dramatically boost reasoning depth without increasing token count.
However, even these advanced scaling techniques feature caveats.
A study on “sleep-time compute”—precomputing reasoning offline to accelerate real-time inference—found 5× to 18% gains on some tasks while cutting inference costs—but didn’t address inverse-scaling behaviors directly. Meanwhile, energy analysis shows an important trade-off: while test-time compute can boost accuracy, it also increases inference cost and latency, especially for longer sequences.
And here, Anthropic adds another issue to address.