The Diamond-Water Paradox, And How LLM AI Requires Copyrighted Data To Function

The abundance of data on the internet has led to a shift in the economics of information, and Large-Language Model AIs have disrupted it.

Traditionally, the price of a good or service is determined by its scarcity and demand. However, the abundance of data on the internet has made it difficult to apply this principle. The cost of producing and distributing digital data is negligible, which means that the supply of data is virtually infinite. However, the demand for data is still high.

What this means, the price of data has not decreased despite its abundance.

This phenomenon is known as the paradox of value or the diamond-water paradox.

The paradox states that the value of a good is not determined by its usefulness but by its scarcity. For example, water is essential for life, but it is abundant and therefore has a low price. On the other hand, diamonds are not essential for life, but they are scarce and therefore have a high price.

The value of data is not determined by its usefulness but by its ability to generate insights and drive decision-making. Therefore, the demand for data remains high, and its price has not decreased despite its abundance.

In the case of data, the abundance of data on the internet has not led to a decrease in its price because data is still valuable to businesses and individuals.

And more importantly, it's required to create LLM AI models that power OpenAI ChatGPT, Google Imagen, Microsoft Copilot, Stability AI Stable Diffusion, and more.

And even more importantly, LLM AI models require so much data that the companies behind them needs to go beyond what's necessary to feed these models' gluttony on data.

According to OpenAI, LLM AIs would be impossible to create without using copyrighted material for training.

LLM AIs are trained using a vast trove of data taken from the internet, with much of it covered a legal protection against someone’s work being used without permission.

The media has called this "unlawful," but according to OpenAI, this is a requirement.

As explained by OpenAI:

"Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials."

"Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens."

OpenAI argues that limiting training materials to only out-of-copyright books and drawings would produce inadequate AI systems.

In a blog post, OpenAI said that scraping of others' online works falls within the purview of "fair use."

"The principle that training AI models is permitted as a fair use is supported by a wide range of [people and organizations]," OpenAI explained.

"Because copyright today covers virtually every sort of human expression—including blog posts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials," OpenAI explained.

After all it an AI has a mission to ensure then development of Artificial General Intelligence to benefit all of humanity, why anyone should bother?

OpenAI said it believed that "legally, copyright law does not forbid training."

LLMs, rely on massive datasets of text and code for training. This training data allows them to learn patterns and relationships, ultimately enabling them to perform tasks like generating text, translating languages, and writing different kinds of creative content. However, some of this training data might be copyrighted material, raising legal and ethical concerns.

OpenAI and other companies argue that fair use principles might apply, as the model uses the material for a transformative purpose (learning patterns, not directly copying expression), but others argue that the term "fair use" doesn't cover commercial applications.

The paradox here relates to the idea of intrinsic versus extrinsic value.

Seeing it through the diamond-water paradox, copyright grants extrinsic value to creative works, but LLMs utilize them for their intrinsic information, not necessarily replicating their creative expression.

The debate lies in balancing the rights of copyright holders with the potential benefits of AI development.

Should creative expression be shielded, even if it hinders technological progress? Or should fair use be expanded to accommodate transformative AI applications?

The relationship between LLM training and the diamond-water paradox highlights the intricate interplay between intellectual property, technological innovation, and societal values.

Finding a fair and sustainable path forward requires ongoing dialogue and a nuanced understanding of both the extrinsic and intrinsic values at stake.

Search form

The Diamond-Water Paradox, And How LLM AI Requires Copyrighted Data To Function