Google Allegedly, And Mistakenly Leaked Documents On How Its Search Engine Work

28 May 2024

Google Search is the largest search engine of web.

It's the very product that resides deep in the core of Google's massive empire. It's a money-making product, generating billions upon billions of dollars per year.

It's the thing that makes Google, Google.

And this time, a trove of leaked documents has given the world not only a glimpse, but an unprecedentedly thorough look inside Google Search.

Among the information leaked, the documents revealed some of the most important elements Google uses to rank content.

Among the things that can be found:

2,596 modules are represented in the API documentation with 14,014 attributes.
While the documents did not specify how any of the ranking features are weighted, the documents only said that they exist.
Twiddlers, considered the re-ranking functions that "can adjust the information retrieval score of a document or change the ranking of a document."
Content can be demoted for a variety of reasons, like if a link doesn’t match the target site, SERP signals indicate user dissatisfaction, product reviews, location, exact match domains, pornography.
Google apparently keeps a copy of every version of every page it has ever indexed. Meaning, Google can “remember” every change ever made to a page.
Links matter, and so do successful clicks.
Longer documents can get truncated, while shorter content gets a score based on originality.

According to Google’s internal documents:

Freshness matters. Google looks at dates in the byline (bylineDate), URL (syntacticDate) and on-page content (semanticDate).
Google vectorizes pages and sites, then compares the page embeddings (siteRadius) to the site embeddings (siteFocusScore).
Google stores domain registration information (RegistrationInfo).
Google has a feature called titlematchScore that is believed to measure how well a page title matches a query.
Google measures the average weighted font size of terms in documents (avgTermWeight) and anchor text.

[block:block=87]

There is also information regarding:

Human evaluators, or known as Quality Raters. The documents show guidelines that help them assess the quality of search results.
Expertise, Authoritativeness, and Trustworthiness (E-A-T). Google considers these factors when evaluating content. High E-A-T pages tend to rank better.
The evaluation of page quality based on factors like main content, supplementary content, and the overall user experience. Pages with thin or low-quality content may be demoted.
User satisfaction metrics, like Click-Through Rate (CTR), Dwell Time, and Bounce Rate.
Core updates, which Google periodically release to its core algorithm updates that impact rankings.
BERT (Bidirectional Encoder Representations from Transformers), a natural language processing model. BERT helps Google understand context and nuances in search queries.
Local search, where Google considers factors like proximity, relevance, and prominence.

Other information suggests that bran matters to Google, more than anything else. Google uses siteAuthority for this.

Google also uses a module called ChromeInTotal to indicate that its uses data from its Chrome browser for ranking.

It's also revealed that Google whitelists certain domains, like those related to elections and COVID-19, and that smaller sites are given smallPersonalSite to easily boost or demote them.

Other information also include the way Google ranks content and comments on YouTube, among many other things.

Screenshot of the document's page on GitHub.

This “Google API Content Warehouse” contains internal API documentation that explains to employees how the various components that generate Search results work.

The 2,500 pages documents appear to come from Google’s internal Content API Warehouse, and were released March 13 on Github by an automated bot called yoshi-code-bot.

These documents were shared between Erfan Azimi, CEO and director of SEO for digital marketing agency EA Eagle Digital, and Rand Fishkin, SparkToro co-founder.

It's worth noting that Azimi is not employed by Google.

According to Fishkin, who worked in SEO for more than a decade, the 2,500 pages document was shared to him with the hopes that reporting on the leak would counter the "lies" that Google employees had shared about how the search algorithm works.

Sooner than later, the leak was thoroughly inspected by developers alike.

Given by how big Google's influence is, and how it's the most consequential system on the internet, dictating what sites live and die and what content on the web looks like, many marketers and SEO specialists are also trying to figure out the inner working of the search engine's ranking method.

They're trying to learn the factors Google Search takes into consideration when ranking and displaying web results.

However, there is some dispute as to whether these documents were "leaked" or "discovered."

Things like these can reveal some of Google Search's inner workings.

Some suggest that the internal documents were accidentally included in a code review and pushed live from Google’s internal code base, where they were then discovered.

It's worth noting that there is no hard evidence that this "leaked" data is actually from Google Search

Despite ex-Googlers have said that the documents are legit and that they're familiar with the formatting, they could only confirm that the data looks like it resembles internal Google information, but cannot guarantee that it originated from Google Search.

About a day later, Google confirmed that the leak is the real thing.

The company shed light into the situation, suggesting that the documents is indeed something of Google's.

"We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information," Google spokesperson Davis Thompson said.

"We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation."

The History of Going Viral

A visual timeline of internet culture's defining moments.