Anthropic Uses AI To Sanitize AI Training Data. Separating Harm From Help By Targeting Risk At The Source

AI is as good as the data it has been trained on.

In the massive, global-scale arms race between tech companies, large and small, data is the currency for AI longevity. Since the war escalated to a whole different level, thanks to the launch of ChatGPT by OpenAI, companies are gobbling up data from whatever sources they can find.

The thing is, data contains information that came from various sources, and not sources are good.

In many cases, subjective opinions, biases, harmful materials and even malicious intentions can be found. And when these information are fed to AI systems, they can inherit these traits.

While the AIs do get smarter, they also become more dangerous.

Anthropic is an AI research and development company founded in 2021 by former OpenAI employees, including Dario and Daniela Amodei. Headquartered in San Francisco, Anthropic focuses on creating reliable, interpretable, and steerable AI systems with an emphasis on safety and ethical considerations.

And this time it's experimenting on a way to use AI to clean the ugly before it becomes uglier.

The wealth of data used in AI training contains hazardous CBRN information. Developers usually train models not to use it.

Here, we tried removing the information at the source, so even if models are jailbroken, the info isn't available.

Read more: https://t.co/G2H8Wl2Dhv
— Anthropic (@AnthropicAI) August 22, 2025

Anthropic aims to build AI systems that align with human values and can be trusted to operate safely in various contexts. The company employs an interdisciplinary approach, combining expertise in machine learning, policy, and ethics to guide its research and development efforts.

In an experiment, the company said that it's filtering out dangerous information at pretraining.

Particularly, Anthropic is experimenting with ways to remove information about chemical, biological, radiological and nuclear (CBRN) weapons from its AI models’ training data, but without affecting performance on harmless tasks.

The company acknowledges that large language models (LLMs) are trained on enormous collections of text that come from books, websites, code, articles, forums, and basically all publicly available knowledge they can use. This can be hundreds of billions of words.

During training, the model doesn’t memorize everything like a database. Instead, it learns patterns in language. To answer a question, LLMs seek how words, sentences, and concepts relate to each other in order to build an internal statistical map of words that commonly appear together, how ideas are described, and what kinds of reasonings or facts are associated with certain questions.

In other words, LLMs 'absorbs the essence' of text it learned from, rather than storing exact copies of documents.

This is where knowledge gets intertwined.

If one information is deleted from the training material, an LLM may not be able to answer a question that relates to it.

Another way of saying it: information is dual-use, and that it can offer both potential for harm and legitimate applications. This, according to Anthropic, which poses a challenge for targeted interventions.

One concern is that filtering CBRN data will reduce performance on other, harmless capabilities—especially science.

But we found a setup where the classifier reduced CBRN accuracy by 33% beyond a random baseline with no particular effect on a range of other benign tasks. pic.twitter.com/24xCQBjejh
— Anthropic (@AnthropicAI) August 22, 2025

In order to keep the knowledge loss to a minimum, Anthropic creates a model to filter pretraining data, which involves scoring the harmfulness of each document and removing those exceeding a certain threshold. This in theory, allows the trade-offs between safety and usefulness to be tailored to specific requirements.

A key component is the harmfulness classifier, which is designed to be accurate, reducing harmful content while minimally impacting usefulness, and efficient, given the massive size of pretraining corpora.

Anthropic tested six methods for detecting CBRN content, including fine-tuned and prompted constitutional classifiers, holdout loss models, FastText classifiers, and named entity string matching, with different backbone models depending on computational expense and efficiency.

Classifiers were evaluated using synthetic labeled data generated by prompting LLMs to produce harmless and harmful documents, with the prompted constitutional classifier achieving the highest F1 score, followed closely by the fine-tuned classifier.

Due to computational constraints, the smaller fine-tuned classifier was used to scan the full pretraining corpus, with additional variants combining it with other methods for reranking or parallel flagging of harmful content.

To measure the impact of data filtering, the company pretrained paired models from scratch, comparing unfiltered datasets containing both harmful and harmless content to filtered datasets containing only harmless content. Harmful capabilities were evaluated using multiple-choice questions from WMDP, while harmless capabilities were assessed with questions from MMLU, as well as Prose, Code, and Math tasks, ensuring that filtering did not inadvertently reduce usefulness.

Performance metrics were normalized relative to no filtering, with effective filtering reflected by a substantial reduction in harmful capabilities and minimal change in harmless ones.

Among the tested filtering methods, the fine-tuned constitutional classifier alone emerged as the most effective, successfully separating harmful from harmless content and allowing adjustable safety-usefulness trade-offs. At a selected threshold, the approach reduced harmful capability performance by 33% while causing no significant drop across harmless evaluations, demonstrating that pretraining data filtering can enhance model safety without compromising utility.

If you’re interested in joining us to work on these and related issues, you can apply for our Research Engineer/Scientist role (https://t.co/x3G4F5qVWv) on the Alignment Science team.
— Anthropic (@AnthropicAI) August 22, 2025

Anthropic's research of targeting pretraining data filtering shows that it can reduce harmful capabilities in AI models while largely preserving their usefulness.

By removing high-risk content from the training data, the method addresses the potential for misuse without significantly impairing general knowledge or problem-solving ability. This demonstrates that interventions at the data level can directly influence the safety profile of a model.

In the test, the researchers focused on sensitive and potentially dangerous information, such as details on chemical, biological, radiological, and nuclear (CBRN) weapons. This is because misuse of this information by individuals with basic technical knowledge could pose serious risks. But the work also highlights broader opportunities for mitigating AI alignment risks.

For example, excluding material can help mitigate AI's misaligned behaviors, such as scheming, reward manipulation, or deceptive strategies.

The research suggests that careful curation of pretraining data can be an effective tool in shaping model behavior before deployment.

Future research could improve the accuracy and efficiency of classifiers used for filtering or explore alternative approaches, such as influence functions, to identify high-impact examples. Evaluating how models trained on filtered data respond to adversarial finetuning or in-context prompting would further clarify the robustness of pretraining interventions.

Overall, the findings indicate that pretraining data filtering can play a substantive role in producing models that are both capable and safer, providing a basis for continued investigation into methods that reduce potential risks while maintaining broad usefulness.

Published:

23/08/2025

Dark Mode

Search form

Anthropic Uses AI To Sanitize AI Training Data. Separating Harm From Help By Targeting Risk At The Source