The Transparency Question: OpenAI Sora 'Used Publicly Available And Licensed Data'

Mira Murati
CTO of OpenAI

It's important for a strategic leader to exude confidence, but it's equally important for them to be authentic and acknowledge their limitations. After all, it's all about striking that balance and demonstrating qualities that inspire respect.

Since OpenAI started the generative AI trend when it launched ChatGPT, and later DALL·E, DALL·E 2, and DALL·E 3, and later, Sora, the company has managed to keep itself at a competing edge, despite having pretty much all tech giants trailing behind it.

In the overly-hyped industry, and where people are still curious about the technology, an "I don't know" may not be the best answer.

Mira Murati is the CTO of OpenAI, and in an exclusive Wall Street Journal interview with tech columnist Joanna Stern, she somehow casted everything from doubt, to uncertainty, lack of transparency, and big questions toward the commercial future of human data.

Mira Murati
Mira Murati has the role of OpenAI's Chief Technology Officer.

Murati was asked about the data OpenAI used to train Sora, in which Murati responded with:

"We used publicly available and licensed data."

But when she was asked about whether OpenAI used videos from YouTube as datasets, Murati scrunched up her face.

"I’m actually not sure about that."

And when asked about videos from Facebook and Instagram, she again rambled at first.

"You know, if they were publicly available, publicly available to use, there might be the data, but I'm not sure. I'm not confident about it."

After being asked whether OpenAI used data from Shutterstock, Murati concluded:

"I'm just not gonna go into the details of the data that was used, but it was publicly available or licensed data."

The executive seemingly tried to avoid the questions, by replying using a template-like response, delivered in a manner almost as mechanical and automatic as OpenAI's products.

Murati's interview may not be considered a PR masterpiece, but working as an executive at the company, there is no chance Murati would be transparent and provided details.

Especially, not when OpenAI is still riddled with copyright-related lawsuits.

Her reaction and her response to the questions are expected, because people are also lambasting OpenAI all over the internet.

It was her genuine reaction to questions she know she have to answer sooner or later.

And during the interview, she played it safe.

But for the industry, this shows the murky future of human data, and the boundaries people have in the privacy of their data, and what they own on the web.

Data collection has a long history, and have been mostly used for marketing and advertising.

But at least in theory, because data brokers have made this a commodity, and that online platforms have turned privacy into a lucrative business.

And this time, generative AIs from OpenAI, and others, like from Google, Meta and more, literally 'demand' access to the whole World Wide Web, and are willing to pay to get data behind paywalls and those that they cannot easily access through URLs.

As a result, generative AI models have indirectly "stolen" user data, and in turn, threatened people's work and jobs, and that there is little anyone can do about it.

As more people become more educated about this generative AI technology, the question is, will the public accept the fact that the YouTube videos they post, the Instagram videos they share, the Facebook posts set to “public” have, are used to train commercial AI models in order to make tech companies tons of money? Is there any compensation?

The issue of training data is not simply a matter of copyright, because it’s also a matter of trust and transparency.

OpenAI and others don’t really care that much about “public” opinion as they push to reach whatever they believe AGI is.

The devil is the data, and companies like OpenAI, Google and Meta are still juggling with their chances.