Background

Google Open Sources Tool That Can Identify And Watermark AI-Generated Content

Synthid

Generative AI is growing fast, and because the adoption is also rising, there is one big consequences.

And that consequences is the more widely-available AI-generated content being shared on the web and beyond. At first glance, this may not pose significant problem. However, a lot of people have been using this technology to disseminate misinformation, disinformation and malinformation, creating damage in the long run.

Google has been in the AI field for a long time. In fact, its Google DeepMind has been actively developing AI tools to enhance other industries.

And here, the British-American artificial intelligence research laboratory which serves as a subsidiary of Google is releasing what it calls the 'SynthID' tool.

According to the company, SynthID can "identifying AI-generated content" to then watermark them in order to distinguish them.

SynthID utilizes deep learning algorithms to automatically add watermark to AI-made content.

Not only that, because SynthID can scan content to determine if portions were created by AI.

The company claims these watermarks remain detectable, even after alterations like cropping, filtering, color changes, and compression.

And here, the watermarks are invisible to the human eyes.

"SynthID’s watermarking technique is imperceptible to humans but detectable for identification," said DeepMind in a dedicated web page.

The main purpose of SynthID is to automate the detection and labeling of AI-generated content on a large scale, aiming to prevent misuse, such as deepfakes, misinformation, or financial fraud.

SynthID works at scale by integrating watermarking with speculative sampling, an efficiency technique commonly used in production systems.

For text, SynthID adjusts the probability of token selection during the AI generation process and can apply watermarks to content as short as three sentences.

SynthID does this by utilizing the way Large Language Models generate text, which is by generating text one token at a time.

Knowing that these tokens can represent a single character, word or part of a phrase, SynthID creates a sequence of coherent text, and then predicts the next most likely token to generate. These predictions are based on the preceding words and the probability scores assigned to each potential token.

This process is repeated throughout the generated text, so a single sentence might contain ten or more adjusted probability scores, and a page could contain hundreds. The final pattern of scores for both the model’s word choices combined with the adjusted probability scores are considered the watermark.

For images and video, SynthID can embed watermarks directly into pixels and frames, with Google asserting that they can endure alterations like cropping or compression.

For audio, SynthID transforms sound waves into spectrograms, embedding watermarks that remain intact even through compression or speed changes.

According to Pushmeet Kohli, the vice president of research at Google DeepMind:

“Now, other [generative] AI developers will be able to use this technology to help them detect whether text outputs have come from their own [large language models], making it easier for more developers to build AI responsibly."

Watermarking for AI-generated content has faced challenges in being production-ready due to strict quality, detectability, and computational efficiency requirements.

This happens because Large Language Models are actively being used to spread political misinformation, generate nonconsensual sexual content, and for other malicious purposes.

SynthID was announced last August, and this time, Google believes SynthID is ready for production.

SynthID has been incorporated into various Google products and released as open-source software via the Google Responsible Generative AI Toolkit.

Google has also collaborated with Hugging Face to make the technology accessible to developers.

While SynthID is remarkable because it doesn’t compromise the quality, accuracy, creativity, or speed of generated content, which has long been an issue with watermarking systems.

But still, it's not perfect.

“SynthID isn’t a silver bullet for identifying AI generated content,” Google wrote in a blog post back in May.

“[But it] is an important building block for developing more reliable AI identification tools and can help millions of people make informed decisions about how they interact with AI-generated content.”

Published: 
25/10/2024