The internet has grown into a massive network with billions of people are connected on their daily basis. The numbers kept increasing as the network itself grow, and the use of handheld devices, as well as more cheaper connection are available.
At its young age, the internet itself has been making a world on its own where any user participation can be viewed by anyone else, anywhere and anytime.
Despite the massive amounts of resources in computing power dedicated by search engine companies to crawl and index countless of documents and data on the internet, search engines still are unable to do nearly any human can: to understand. This is because web indexing has been on the words found on webpages, and not by what they mean.
Shashi Thakur, a technical lead for Google's search team said that since the beginning, search engines have essentially matched strings of text. And when people tried to match the strings, they don't get a sense of what those strings mean. Users should have a connection to real-world knowledge of things and their properties and connections to other things.
Making those connections is the reason for recent major changes within the search engine giants like Google and Microsoft. Google’s Knowledge Graph and Microsoft’s Satori both extract data from the unstructured information on webpages to create a structured and organized database of the “nouns” available on the internet, such as things, people, places, and all the relationships between them. The changes are not just for design or cosmetic; for Google, for example, this was the company's biggest retooling to their search engine the launch of "universal search" in the year 2007.
These efforts are part of the ideas proposed by Yahoo! Research team in a 2009 paper called "A Web of Concepts," in which the researchers outlined an approach to extracting conceptual information from the wider web to create a search in a more knowledge-driven approach. They defined three key elements to creating a true "web of concepts":
- Information extraction: pulling structured data (addresses, phone numbers, prices, stock numbers, etc.) out of web documents and associating it with an entity.
- Linking: mapping the relationships between entities (connecting the nouns to another entities that correspond the same purpose, meaning and history).
- Analysis: discovering and categorizing information about an entity from the content or from sentiment data.
Both Google and Microsoft have just begun to step into the power of that kind of knowledge as their current respective entity databases are still in their infancy. As of June 1, 2012, Satori had mapped over 400 million entities and Knowledge Graph had reached over half a billion.
Graphing the World Wide Web
Extraction of entities is not exactly the first thing search engines did; for instance, Microsoft acquired language processing-based entity extraction technology when it bought FAST Search and Transfer in 2008. What Google and Microsoft are doing is planning to built the scope of the entity databases. This is by connecting the relationships and actions they are exposing through search, and the underlying the data they are using to handle the massive number of objects within the fractions of a second to render a search result.
Knowledge Graph and Satori themselves are already storing massive amount of databases. These databases. rather than being based on relational or object database models, they are currenly graph databases based on the same graph theory approach used by Facebook’s Open Graph to map relationships between its users and their various activities. Graph databases are based on entities (nodes”) and the mapped relationships (links) between them. The web itself is a graph database with its pages as nodes, and the relationships are represented by the hyperlinks connecting each one of them.
Google’s Knowledge Graph is obtained from Freebase, a proprietary graph database acquired by Google on July 2010 when it bought Metaweb (Metaweb Technologies, Inc.). Shashi Thakur, who is technical lead on Knowledge Graph, says that additional development has been done to get the database to meet Google’s required capacity. Based on the architecture, Knowledge Graph may also rely on some batch processes powered by Google’s Pregel graph engine, the high-performance graph processing tool that Google developed to handle many of its web indexing tasks.
Microsoft’s Satori is a graph-based repository that comes out of Microsoft Research’s Trinity graph database and computing platform. Satori is the result of a collaborative agreement between XenSource and Microsoft, and was carried forward after XenSource was acquired by Citrix Systems. The base Satori components are released by Microsoft as the Linux Integration Components for Hyper-V, and provide support for paravirtualized XenLinux guests running on Hyper-V. Satori uses the Resource Description Framework and the SPARQL query language, and it was designed to handle billions of RDF (Resource Description Framework) entities.
The entities in both Knowledge Graph and Satori are essentially data objects, each with a unique identifier, a collection of properties based on the attributes of the real-world topic they represent, and links representing the topic’s relationship to other entities. They also include actions that users are searching for that topic might want to take.
An entity has its own specific set of properties. These properties can be inherited from one type of entity to another, parts of an entity's information can be linked to other entities.
Google’s Knowledge Graph is based on the same principles, with some significant changes to make it scale to Google’s needs. Shashi Thakur said that when Google purchased Metaweb, Freebase’s database already had 12 million entities; Knowledge Graph now tracks 500 million entities with over than 3.5 billion relationships between those entities. To ensure that the entities themselves did not become bloated with underused data and hinder the scaling-up of the Knowledge Graph, Google’s team threw out the Freebase's user-defined patterns and use Google's search query stream.
On the other hand, Microsoft has used its own search stream to help with modeling. The query processing was more focused on mining searches for actions rather than for specific types of properties for which people searched. Researchers from Microsoft Research’s Natural Language Processing group described the process of creating “Active Objects,” a set of dynamic actions that could be assigned as properties to certain types of entities.
Preparing for the Internet
Both Google and Microsoft have had plenty of experience in extracting content from webpages in their developing years. Populating the Knowledge Graph and Satori entity databases starts in much the same way as web indexing, with web crawlers gathering the text from trillions to countless of webpages.
When Satori’s crawler discovers new objects within webpages; entities are created for them before being added to the database. All the information on that entity is rarely in one place. When crawlers find another page that contains an entity that Satori has already identified, they tag the page with that entity’s signature. New characteristics Satori finds during later processing get aggregated into the original entity.
As Satori processes the webpages identified as being related to an entity, it goes through the content and pulls out additional characteristics, eventually building a model based on all the available data scattered across the web.
Google’s extraction process also starts with its crawl of webpages. Shashi Thakur said that as Google’s crawlers crawl through documents, "there are text interpretations that happen" to determine the topic of a page. As pages are indexed by Google, they are also processed by natural language-based semantic tools.
But neither Google nor Microsoft send their entity extractors out into the wilderness of the web without preparations. Both had stored data already on hand to give their entity engines with some well-structured starter entities. In Google’s case, the Knowledge Graph team was able to begin with the 12 million entities that had already been put into Freebase by its user community, as well as by drawing on open sources of data such as Wikipedia. Microsoft was more focused on providing objects that would best play into its Action Object model.
Pushing the Boundaries
Despite having similar technology, Google and Microsoft's new results are colored mostly by how they chose to seed their data. Because they have been built largely around the most popular sorts of searches, both entity databases still contain plenty of holes and weaknesses.
Filling those holes might not be a big problem. But as the semantic databases of Knowledge Graph and Satori grew larger, the more they will potentially put a lag on the performance of search. Google already is prefetching mass amount of Knowledge Graph results in cache to avoid a performance hit. And while Satori's technology base is designed to grow to up to 800 million entities and still be able to handle queries in 50 milliseconds, the Bing Satori store is already halfway to that number after just a few weeks crawling the Web.
Google and Microsoft are already exploiting their entity databases beyond their main search pages. But entity-based search is still in its earliest stages.
Knowledge Graph and Satori are operating in US English contents with several languages added to the entity extraction language processing of the search engines. The number of entities and relationships they have to manage is bound to explode, both in terms of number and complexity.
To truly "understand" the web, Knowledge Graph and Satori are going to have to get a lot smarter. And they are made to push beyond the bounds of semantic processing and computing forward, as bigger graphs of knowledge are shoved into their memories.