
In the whirlwind of AI breakthroughs that define the modern era, Anthropic's Claude Mythos has emerged as one of the most intriguing enigmas.
Released just weeks ago, Anthropic releases it as a preview model within the Claude family. It didn't take long until Mythos turned heads for its uncanny prowess in cybersecurity tasks, capable of identifying vulnerabilities, crafting exploits, and reasoning through multi-step security challenges with a level of finesse that outpaces anything the company had previously built.
Yet Anthropic offered no deep technical paper, no architecture breakdown, leaving the research community hungry for answers.
What exactly powers this leap in capability?
Is it simply more parameters and data, or something fundamentally smarter?
Now, thanks to an audacious open-source effort, a compelling hypothesis rendered in clean, runnable PyTorch code: 'OpenMythos,' a first-principles reconstruction that suggests Mythos may not follow the conventional transformer playbook at all.
2 /
I hypothesize that Mythos is a Recurrent-Depth Transformer (RDT) a class of looped transformer in which a fixed set of weights is applied iteratively across T loop steps within a single forward pass.
Crucially, reasoning occurs entirely in continuous latent space. There is… pic.twitter.com/SRjJAjW0qo— Kye Gomez (swarms) (@KyeGomezB) April 19, 2026
At its heart, OpenMythos flips the script on how we scale intelligence in large language models.
Traditional transformers, like those powering GPTs, LLaMAs, or even earlier Claudes, rely on stacking layer upon layer, each with its own unique weights. To make them smarter, the method involves adding more parameters, more layers, more compute during training.
OpenMythos, created by Kye Gomez and released on GitHub, proposes instead that Claude Mythos belongs to a rarer breed known as Recurrent-Depth Transformers, or looped transformers.
Here, a compact set of weights is reused iteratively: up to 16 times in a single forward pass within a carefully structured recurrent block.
The model doesn't grow wider or deeper by duplicating parameters; it thinks harder by looping the same computation, injecting the original input at every step and refining its hidden state progressively. The result is a lean 770-million-parameter model that, when trained on the same data as a standard transformer, delivers performance on par with a 1.3-billion-parameter counterpart.
That's roughly half the memory footprint for equivalent quality, a tantalizing efficiency gain that challenges the "bigger is always better" dogma.
5 /
On parameter efficiency: a looped model with k layers run L times achieves the quality of a kL-layer standard transformer with only k layers of parameters.
Empirically (Parcae, Prairie et al., 2026): at 770M parameters, an RDT matches a 1.3B standard model on the same… pic.twitter.com/cmeKiQiDiJ— Kye Gomez (swarms) (@KyeGomezB) April 19, 2026
The architecture itself is elegantly modular, built around three phases: a one-time Prelude that encodes the input, a looped Recurrent Block that does the heavy lifting, and a final Coda that produces the output.
Inside the recurrent loop, stability is the name of the game, something looped models have historically struggled with, as hidden states can explode or drift into nonsense after too many iterations. OpenMythos tackles this with Linear Time-Invariant (LTI) injection constraints, borrowed from recent research, ensuring the carry-forward matrices keep the dynamics bounded.
It also introduces Adaptive Computation Time, a learned mechanism that lets the model decide on the fly when to stop looping per token, preventing "overthinking" where extra iterations actually degrade results.
The feed-forward network inside the loop draws inspiration from DeepSeek's Mixture-of-Experts design: a vast pool of specialized experts, sparsely activated via a router that picks different subsets at each depth level, plus always-on shared experts for common patterns.
Attention is handled via Multi-Latent Attention, compressing key-value caches dramatically for better inference scaling.
Even clever depth-wise LoRA adapters sneak in subtle behavioral tweaks across loop iterations without bloating the parameter count.
6 /
OpenMythos contributes:
1. A fully open, configurable PyTorch implementation of the RDT hypothesis with MoE FFN and Multi-Latent Attention
2. LTI-stable recurrent injection (Parcae) integrated as a first-class training primitive
3. Depth-wise LoRA adapters enabling… pic.twitter.com/sSX2FHPWgy— Kye Gomez (swarms) (@KyeGomezB) April 19, 2026
But what makes OpenMythos unique isn't just the clever engineering, but the philosophy behind it.
OpenMythos isn't a leaked weights dump or a distilled copy of Mythos. Instead, it is a falsifiable hypothesis built from peer-reviewed papers on recurrent architectures, MoE routing, and inference-time scaling.
Gomez has open-sourced the full PyTorch implementation, complete with reproducible training baselines, so anyone can experiment, train, or extend it. Early buzz highlights how the project reframes the entire scaling conversation: reasoning depth becomes a function of inference-time compute rather than pre-trained parameter volume. Train once with a fixed set of weights, then at runtime decide how deeply to think. Power laws emerge that mirror those of standard models but with far leaner storage, opening doors for deployment on everything from edge devices to massive inference farms.
Of course, this remains a reconstruction, not confirmation.
Anthropic has stayed characteristically tight-lipped about Mythos's internals, and OpenMythos stands as an invitation for the community to test the theory in the wild.
Whether it perfectly mirrors the real Mythos or simply pioneers a new family of efficient architectures, the project has already delivered something precious: a concrete, hackable playground for studying looped dynamics, expert routing, and emergent multi-step reasoning.
7 /
This is an open research effort. We welcome contributions on training stability, scaling experiments, loop depth analysis, and alternative attention mechanisms.
If you work on recurrent transformers, MoE, or inference-time scaling we would value your involvement.
Repo →…— Kye Gomez (swarms) (@KyeGomezB) April 19, 2026