How 'Harness-1' Tries To Reinvent Search Using An AI Agent That Never Forgets

Imagine trying to chase down a question that refuses to stay simple. It starts with one fact on a webpage, branches into a financial report, loops back through a patent filing.

As one goes deeper into the rabbit hole, one of the most difficult parts is keeping track of what that have been verified while deciding what to explore next. Most users of large language models (LLMs) chatbots must have felt the frustration when their AI assistant begins to repeat itself, forgets an earlier source, or simply gives up after a handful of steps.

That is the quiet limitation hiding behind the polished interfaces of today's search tools.

'Harness-1' want to change that.

Using a twenty-billion-parameter, the search agent built by Patrick Jiang offers a different path through that complexity.

Instead of forcing the model to hold every detail in its own working memory, the system places the heavy lifting of organization outside the model itself.

Introducing Harness-1, a 20B search agent trained with a state-externalizing harness.

> frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4

> Context-1-level cost and latency

> externalizes candidates, evidence, verification, and search history

>… pic.twitter.com/DWihM4c6Ii
— Patrick Jiang (@patpcj) June 6, 2026

The core policy, the part that actually decides what to do next, stays focused on high-level judgments: which query feels right, which document deserves closer attention, which claim needs checking, and when the search has truly reached its end.

Everything else lives in a separate harness, a structured workspace that never forgets and never lets the context balloon out of control.

At the heart of this workspace sits a carefully managed pool of candidate documents.

Incoming results are automatically compressed, sentences are scored for relevance, duplicates are removed through content fingerprints, and the whole collection is kept tidy.

A curated set holds no more than thirty entries at any time, each tagged with an importance level that the agent can adjust on the fly.

[2/N] The usual search-agent setup is basically:

search → read → search → read → keep appending everything to the transcript.

At some point the model is not just “searching” anymore.

It is also being asked to be a memory system, a note taker, a verifier, and a librarian.
— Patrick Jiang (@patpcj) June 6, 2026

There is also a full-text memory store, an evidence graph that draws connections between names, dates, and cross-referenced facts, a log of every verification performed, and a compressed history of past searches.

When the model receives its next observation, it does not wade through a long rambling transcript.

It sees a clean, readable summary of the current state, complete with budget markers that show how much context remains available.

The agent communicates with this workspace through a small set of precise actions.

It can issue a search and have the results filtered and deduplicated before they even reach the candidate pool. It can promote or demote documents, trigger a verification step on a specific claim, or review a single piece of evidence without reloading the entire conversation.

[4/N] Harness-1 tries to separate these two jobs.

The model still makes the semantic decisions:
what to search, what to read, what to keep, what to verify, when to stop.

But the harness maintains the recoverable state around those decisions.
— Patrick Jiang (@patpcj) June 6, 2026

Because the state is external and editable, the model never has to reconstruct its own history from scratch.

That single change turns a brittle, linear process into something more like working inside an organized digital notebook that updates itself in real time.

Training the agent followed a measured two-stage approach.

A brief period of supervised fine-tuning on just 899 high-quality trajectories taught the basics: how to use the tools, how to assign importance tags, how to keep the rhythm of curation steady, and how to know when to stop. Reinforcement learning then took over on 3,453 queries drawn mostly from finance and open-web domains.

[6/N] I think this changes what RL is actually learning.

Instead of training the model to survive a giant append-only transcript, we train it to use a structured search interface:
search, curate, revisit, verify, and submit.

Much closer to how I’d want a search agent to work.
— Patrick Jiang (@patpcj) June 6, 2026

Rewards were shaped to value accurate final answers, thorough coverage, and efficient use of turns.

The modest data budget worked because the harness already supplied most of the scaffolding; the model only needed to learn the judgment calls.

When tested across eight demanding benchmarks, like web search, finance, patents, and multi-hop question answering, Harness-1 reached an average curated recall of 0.730.

That result sits ahead of the best prior open-source search agents by more than eleven points and remains competitive with far larger frontier models that carry their own elaborate scaffolding.

[8/N] The result that made me most excited is transfer.

Harness-1 improves over Context-1 by +7.9 recall points on source-family benchmarks.

But on held-out transfer benchmarks, the gain is +17.0 points.

That’s the part that made the idea feel real to me. pic.twitter.com/hWbl8oYEtF
— Patrick Jiang (@patpcj) June 6, 2026

The advantage grew even larger, by seventeen points, on held-out tasks that the agent had never seen during training, suggesting the behaviors learned inside the harness transfer cleanly to new territory.

Removing individual components of the harness, like importance tags, the evidence graph, automatic seeding, or verification, produced noticeably shallower searches and weaker final results, confirming that the architecture itself drives much of the gain.

What makes Harness-1 feel different from everyday search engines is the shift from static lookup to dynamic, guided exploration.

Traditional engines excel at matching keywords against pre-indexed pages and surfacing the most popular matches, yet they stop short of reasoning across sources, curating evidence, or correcting course when early results mislead.

Many newer AI-powered search tools still rely on stuffing the full conversation history back into the model at every turn, which quickly leads to the same memory overload that plagues simpler agents.

By contrast, the external harness keeps the policy light and the workspace reliable, allowing the search to stay coherent over dozens of steps and recover gracefully from early missteps.

[10/N] My takeaway:

for search agents, “the model” is not the whole learning system.

The interface matters.
The memory layout matters.
The action space matters.
The harness matters.

If we want RL to teach better search behavior, we should probably stop making the model do all…
— Patrick Jiang (@patpcj) June 6, 2026

The larger insight here is that the right interface can reshape what an agent actually learns.

Rather than training the model to simulate perfect recall or flawless long-term planning from raw context alone, Harness-1 supplies an observable and editable memory layer that supports backtracking, refinement, and explicit tracking.

The result is more systematic evidence gathering, clearer multi-hop reasoning, and a search process that feels less like a black box and more like a methodical collaborator. With the model weights, code, and accompanying paper now released in full, Harness-1 stands as a practical demonstration that thoughtful separation of concerns can extend agent capabilities without simply scaling up parameter counts.

However, this kind of search engine does have some weaknesses.

Paper : https://t.co/liBVj5wdBy
Code : https://t.co/PkJL83u95e
Model : https://t.co/dteF6WrRcq
HF Paper: https://t.co/pyFygfGdYQ
— Patrick Jiang (@patpcj) June 6, 2026

For example, deploying the agent requires more setup than a simple model call or off-the-shelf search API, and integrating it into existing pipelines can introduce engineering overhead. The twenty-billion-parameter policy, while efficient, remains bounded by the knowledge and reasoning depth of its base size, so extremely specialized or rapidly evolving domains may expose gaps that larger models fill more readily.

Regardless, Harness-1 represents a practical step in rethinking how search agents are built.

By separating semantic decisions from routine state management it shows that interface design and external scaffolding can unlock capabilities that raw model scale alone struggles to reach.

Published:

06/06/2026

Dark Mode

Search form

How 'Harness-1' Tries To Reinvent Search Using An AI Agent That Never Forgets