OpenAI 'GPTBot' Is An Ambitious But Risky Way For An AI To Consume The Entire Internet

OpenAI WWW

Training data, the only thing that makes AI smart and smarter.

But just like anything else out there, training data to train AIs are limited. No matter how fast new datasets are compiled or created, AIs can consume them a lot faster, and sooner than later, they will run out of data to train from.

The solution to this problem, is by tapping to the vast World Wide Web.

As the place where practically every human knowledge that is documented into the digital sphere is accessible through internet connection, the web is extremely big for any single being, or thing, to consume.

This is why OpenAI, the creator of ChatGPT, is launching what it calls the 'GPTBot'.

With the bot that was introduced without fanfare or an official announcement, OpenAI want to start scanning website content to extract data to train its ever-hungry large language models (LLMs).

GPTBot marks a remarkable stride towards the evolution of AI, because the crawler is designed to fuel the growth of OpenAI's AI models, like GPT-4 and future iterations.

With the bot, OpenAI shows its unwavering commitment to refining AI capabilities by harnessing a diverse array of internet-derived knowledge.

The idea is to amass a broad spectrum of information, which is crucial to train and improve its AI.

The goal is that, by aggregating data from a myriad of sources, GPTBot shall help empower OpenAI's generative AI product, so it can generate responses that are not only accurate but also deeply contextual.

This should help create an AI that is more powerful and more capable.

This groundbreaking initiative positions GPTBot as a key player in shaping the future of artificial intelligence technology.

Shortly after the launch of GPTBot circulated publicly, OpenAI announced a $395,000 grant and partnership with New York University’s Arthur L. Carter Journalism Institute. Led by former Reuters editor-in-chief Stephen Adler, NYU’s Ethics and Journalism Initiative aims to aid students in developing responsible ways to leverage AI in the news business.

"We are excited about the potential of the new Ethics and Journalism Initiative and very pleased to support its goal of addressing a broad array of challenges journalists face when striving to practice their profession ethically and responsibly, especially those related to the implementation of AI," said Tom Rubin, OpenAI’s chief of intellectual property and content, in a release.

Read: Copyright Laws Should Allow AI To Scrape Data From The Internet For Training Purposes, Proposed Google

ChatGPT web scraping.

To make this happen, GPTBot operates by scouting the internet to seek web content that is publicly accessible.

While this certainly provides a way for OpenAI to consume the entire web by utilizing its available resources, the move can violate a lot of privacy.

This is because not every information that is on the internet, is there to see or publicize.

Some information the web has are already decades old, created or uploaded without the subject's consent, or embarrassing.

It may contain long-forgotten sensitive data that should no longer be revealed, lost information that is created for malicious purposes, and more

By training AI using these information, OpenAI's generative AI can spill unwanted information to the public.

OpenAI knows this.

"Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety. Below, we also share how to disallow GPTBot from accessing your site."

This is why when OpenAI added the GPTBot support page, it also introduced a way for website owners and webmasters to block the service from scraping content from their website.

According to OpenAI, all it needs is a small modification to their website to stop GPTBot from ever crawling the website.

"We periodically collect public data from the internet which may be used to improve the capabilities, accuracy, and safety of future models," said an OpenAI spokesperson.

"On our website, we provide instructions on how to disallow our collection bot from accessing a site. Web pages are filtered to remove sources that have paywalls, are known to gather personally identifiable information (PII), or have text that violates our policies."

GPTBot is recognizable by the following user agent token and the entire user-agent string:

User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

To block it, websites just need to add the following line to their robots.txt file:

User-agent: GPTBot
Disallow: /

And to grant GPTBot access to only certain parts of a website, the following can be used:

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

However, due to how extensively an crawler can scrape the web, there is no guarantee that simply blocking GPTBot can completely stop content from being included to OpenAI's LLM training data.

While GPTBot does not intrude upon private or restricted material, and adheres to ethical guidelines by respecting, there is no way of ensuring a particular information is not gathered.

Another concerning thing is that, the modification must be done manually.

In other words, GPTBot is opt-in by default, and websites need to opt-out manually.

Responding to OpenAI's move, a number of outlets and news publications started to flag GPTBot to prevent the bot from grabbing their content.

Data scraping practices have been commencing for decades, with both good bots and bad bots scouting the web to extract information.

But this time, OpenAI is taking data scraping to a new level, because the practice is not only to obtain information, but to reuse the information to power a 'being' that hallucinates and can falsify information.

Published: 
13/08/2023