DARPA's Search Engine for the Deep Web: Shining a Light into the Darkness

Deep web onionThe deep web isn't a common place for most people to search for things on the internet. The network that isn't accessible by normal means of browsing method, is larger that the usual surface web. Popular because of its high anonymity and buried out of most search engine's crawlers, the place is widely used for many illegal activities.

Since the government has long tried to track a lot of those activities with less success, on February 2015, the U.S. Defense Advance Research Projects Agency (DARPA) has publicly presented for the first time a new set of search tools called Memex which will improve and ease the searches in the those dark corners of the web.

The agency also provided a preview of the search engine, as well as explained its use in the fight against the cybercrime.

The project that first launched in 2014 was initially made to allow searches of not indexed content, an operation that in the majority of cases is still run manually by Intelligence Agency. Memex seeks to develop the next generation of search technologies and revolutionize the discovery, organization and presentation of search results.

"The goal is for users to be able to extend the reach of current search capabilities and quickly and thoroughly organize subsets of information based on individual interests. Memex also aims to produce search results that are more immediately useful to specific domains and tasks, and to improve the ability of military, government and commercial enterprises to find and organize mission-critical publically available information on the Internet" said the project's official page.

Most information on the deep web consists of unstructured data that are gathered from multiple sources. Because of this, those information couldn't be effectively crawled by search engines. And because most people use search engines to find what they want on the internet, the massive information on the deep web is pretty much unseen.

"The internet is much, much bigger than people think,” said DARPA's Program Manager, Chris White. "By some estimates Google, Microsoft Bing, and Yahoo! only give us access to around 5 percent of the content on the web."

The most popular subset of the deep web is the Tor network which is an anonymizing network that is accessible only by using specific software.

To overcome the problem given by these anonymous hidden network, Memex was designed to extend the reach of current search capabilities and quickly and thoroughly organize subsets of information based on individual interests. Memex looks behind standard search results for patterns, links and similar behaviors.

"The main issue we're trying to address is the one-size-fits-all approach to the internet where [search results are] based on consumer advertising and ranking," said White.

At the demo, DARPA initiated a search in which Memex gave the results. The researchers explained the data shown is not considered useful, nor valuable to ordinary web users, but the information is beneficial to law enforcement and intelligence agencies.

This is because Memex searches the deep web to create data maps of links and patterns by combination memory and index. These information in turn can me further analyzed to identify associations between deep web websites and criminal groups.

"We're envisioning a new paradigm for search that would tailor indexed content, search results and interface tools to individual users and specific subject areas, and not the other way around," said White. "By inventing better methods for interacting with and sharing information, we want to improve search for everybody and individualize access to information. Ease of use for non-programmers is essential."

"We're trying to move toward an automated mechanism of finding [deep web websites] and making the public content on them accessible," continued White.

To create Memex, DARPA partnered with 17 teams of researchers from both the academic world and private industry.

The Memex program gets its name from a hypothetical device described in 'As We May Think' - a 1945 article for The Atlantic Monthly written by Vannevar Bush, Director of the U.S. Office of Scientific Research and Development (OSRD) during World War II.

In the article, Memex was described as an analog computer that would supplement human memory by storing and automatically cross-reference all of the user's books, records and other information.

Targeting the deep web is also an initiative being developed in the UK.

Deep Web Search Engines

Since the deep web is mostly uncovered and unidentified by normal means, most notable search engines that work on the surface 'clear' web doesn't venture there. There are three basic characteristics that distinguish a search engine for the deep web and those for the surface web.

First is the links between onion sites (websites with .onion extensions), Hidden websites on the Tor network that use this domain suffix aren't friendly with surface web's search engines since their backlink seeking algorithm aren't working well. Second, surface web's search engines likes fast loading websites. Since websites on Tor are slower, it takes time to crawl, in which search engines don't like. Third, onion websites are frequently replacing their addresses, making search engines difficult to index, let alone ranking them.

As a place that once became a headline because of being the distribution source of the 2014 celebrity hack pictures, the deep web isn't a friendly place. The network is more popular to hackers, cyber criminals, illegal transactions, child pornography, and many more. People lurking in those places don't fancy new visitors that don't know what they're doing there in the first place.

There a lot of deep web search engines. But they aren't comparable to those on the surface web. They may suffer temporary outage of attacked by hackers or having security issues. Deep web's search engine is still facing another problem because most contents there require users' registration.

For most people, the hardest part of navigating the deep web is simply knowing where to start. Since there is a whole other world out there, people just need the right search engine for the job. And DARPA's initiative with Memex is trying to give the solution for that.