Copyright laws Should Allow AI To Scrape Data From The Internet For Training Purposes, Proposed Google

12 August 2023

The internet is huge, and it's still growing.

Thanks to its many, many users, and bots, and social media, and more, the web with all of its diversity and complicated matters, continue to grow with a force nothing can compare. Because of this, the amount of information inside it is phenomenal.

Nothing alive and living can ever consume that much information.

But AI is not a living thing, and due to its quick and massive development, it's already running out of data to train from.

Besides relying on synthetic data as training materials, the internet is indeed alluring.

And Google, as one of the pioneers of AI, needs to consume data from the internet to feed its ever-hungry AI.

While it has started scraping data from the internet, Google is met with controversies, and copyright laws.

Data on the web belongs to the owner of the data, the uploader, the creator that puts it there first. Google, even when it has crawled the World Wide Web to populate its search engine, does not own the data, thus it cannot use it without the permission of its owners.

This can severely limit the amount of training material it can gather from the web.

Because of this, Google is urging Australian policymakers to revise copyright laws, allowing generative AI systems like Google Bard, to scrape the internet while providing.

But since copyright is a complex matter, and that Google knows it cannot win everything, it proposes an opt-out option for publishers.

According to a spokesperson of Google, who pointed to a blog post, the tech giant proposes a community-developed web standard, similar to the robots.txt system, that can be used by publishers to control access to their content.

Through the method, publishers can opt out of parts of their sites being crawled by search engines.

While this can be a win-win solution, the opt-out method means that Google wants the web to allow generative AIs to scrape the web by default.

Publishers have to decide themselves whether AI systems can use their content.

Publishers have to manually opt out if they don't want AIs to scrape their content.

[block:block=87]

Google knows that its AI system require millions of data points to produce useful results, and that there is no source of data that is as vast as the internet.

But scraping data from it means that copyright breaches are inevitable.

Some news companies have already started conversations with AI companies about payment for scraping.

The goal of the opt-out system is to make it possible for tech companies to evade making payments for data scraping.

Another way of saying this, with the proposed method, Google, as well as other companies that create generative AIs, don't want to pay for scraping data.

Google's proposed method can disrupt everything, and experts warned that the opt-out system could eventually turn copyright on its head, potentially harming smaller content creators.

While some experts do believe that Google's proposal might be an attempt to establish early norms that exempt companies from paying for content use, Google

It's worth noting that the call for a fair use exception for AI systems is nothing new for Google, and that the company has always talked about a fair use exception for AI systems, but the notion of an opt-out option for publishers is nothing Google has ever proposed before.

The History of Going Viral

A visual timeline of internet culture's defining moments.