Largest Cache of Internet Data Released By Yahoo!: Aiming To Make Computers Smarter

Member Login

Artificial Intelligence (AI) is the next big thing. Yahoo! confirmed this when CEO Marissa Mayer talked about empowering employees to be creative and innovative. While Yahoo! still lags behind Google, Facebook and others that aims for AI development, the company wants to participate. And that is by releasing its cached internet data, the biggest ever released.

Through its ongoing program, Yahoo! Labs Webscope, the company is set to release 13.5TB worth of data gathered from an estimated 20 million Yahoo! users from February to May 2015 across several Yahoo! properties: Yahoo! homepage, Finance, Sports, News and Yahoo! Real Estate.

The massive data comes from 110 billion events drawn from the samples.

Apart from the interactions, the data set also includes basic demographic information such as age, gender and other generalized geographic data, while items in the dataset include title, summary, and key phrases of the news article in question, plus local timestamps, and partial device information.

Users may not have to worry about privacy. As Yahoo! said, the company makes users data anonymous. So any data that is given for research can't be traced back to the user.

The company's security measures as told by Yahoo! Labs' Chief Research Scientist Ricardo Baeza-Yates that the data was actually scrubbed to create a strong barrier for tracing individual identity.

With users privacy intact, Yahoo! has less to worry when the anonymous data and information are given to academic research communities who wants to contribute to the advancement of AI and machine learning.

"Data is the lifeblood of research in machine learning. However, access to truly large-scale datasets is a privilege that has been traditionally reserved for machine learning researchers and data scientists working at large companies - and out of reach for most academic researchers," explained Suju Rajan, Director of Personalization Science at Yahoo! Labs.

Yahoo!'s move to provide that massive data is a significant move by the company in its means of attracting talented people to pave the future of AI.

Learning About Patterns and Behavior

AI hungers for data, and what makes neural networks work at their bests is when they're fed with more than sufficient data. While companies tried to give its AI the data about whatever they can, including cats, Yahoo! wants to help the AI people by giving them all the data they need.

AI on its own knows nothing more than codes it's programmed. What makes them interesting is that they can "learn" and can be "taught". With the large-scale data Yahoo! is providing, researchers can feed neural networks to determine the patterns and behavior of users.

When most users browse the web, they need to have a device that uses a browser using certain IP address. They have to access the web from somewhere, searching for information by landing on web pages. The pages that attracted them must have headlines and contents down below. From those interactions, added by the data the user gave Yahoo! the privilege to know, and also from the data Yahoo! was able to gather on its own, 13.5TB is indeed a huge amount of data.

Feeding on those data and information, academic and industrial researchers should be able to make AI determine (learn) the patterns of headlines or design features that were able to attract a specific group of audience. Determining the patterns can greatly contribute to how machine could learn.

Furthermore, researchers can also incorporate the real-world user behavior in their projects.

A huge quantity of data about human behavior is very essential for machine learning. AI's role is where it can make a computer automatically spots out complex patterns and figures out a solution. The data provided by the tech giant, which is about two third the size of the library of Congress, should be valuable.

The large cache released is also considered to be valuable in a way that it will help researchers to build large-scale algorithms used in corporate, compared to algorithms that are designed for less data. Yahoo previously has released cache, but they were of a much smaller size.

This isn't the first time Yahoo! has dumped its data. Since 2006, it has released more than 50 data sets which encompassed advertising, social data and others, including 100 million Flickr photos in 2014. Yahoo!'s largest past contribution was 413GB, and the largest data ever contributed was 1TB before being dwarfed by Yahoo!'s 13.5TB attempt.

Yahoo! is Struggling. Finding Ways To Get More People

Yahoo! thrived in the early commercialization of the web. By becoming a web directory which is now defunct, Yahoo! was able to categorize for easier browsing. Yahoo! was at its peak before search engines, notably Google, came to play.

Since then, Yahoo!'s business is steadily going down. Throughout its tough time, the company is still finding ways how to get more people on board, and how to keep its millions of loyal visitors loyal. From revamping its services to designing them, including reintroducing Yahoo! Messenger and its attempt to diverse its share of Alibaba, and others.

Now Yahoo! is trying to attract researchers to benefit from one of the fastest growing and competitive field of AI. However, Yahoo!'s decision comes rather late. The company is not alone in the race.

A normal race is all about who get to the finish line first, not who is best in their speech. But when trying to embrace the future where technology should benefit everyone, the race is not about who's winning. It's about how the supporters appreciate their work, thus making the winner able to bring its name up to the podium.

Competitors like Google with TensorFlow, as well as Amazon, Microsoft and IBM are notable names. They've set their steps earlier than Yahoo!, but their contributions, in terms of the quantity of data, is significantly less compared to Yahoo!. The company usually takes minimum risks of revealing trade secrets than others, but this time it thought of taking a bold step believing that the reward could be bigger.

So if Yahoo! succeeds by enabling researchers to accelerate the pace of innovation, Yahoo!, too, will benefit by being able to take those learnings and apply them to its own products.

Dark Mode

Search form

Largest Cache of Internet Data Released By Yahoo!: Aiming To Make Computers Smarter

Learning About Patterns and Behavior

Yahoo! is Struggling. Finding Ways To Get More People