Web Pages With URLs That Are Too Similar Can Be Considered Duplicates, Said Google

Google is the largest search engine of the web.

For many reasons, that fact alone is enough to make it obvious that most websites need to be appealing to the search engine, for them to be seen by users of the web.

Among the many criteria and SEO methods to rank higher on Google's search engine results pages (SERPs), is to avoid duplicate contents. Google has explained how to avoid them in a documentations on its developers website.

But more than often, it's not only the contents that are duplicates, but also their URLs.

And according to Google's John Mueller, the search engine also detects potential duplicate URL patterns.

In other words, Google may also see web pages as duplicates, if their URLs are too similar.

This can happen because Google uses a predictive method to detect duplicates contents based on URL patterns, which could lead to web pages being incorrectly identified as duplicates, even when their contents aren't actually duplicates.

According to Mueller, when Google crawls pages with similar URL patterns, and finds that the URLs contain the same contents, it may then determine all other pages with that URL pattern have the same contents as well.

Crawling may not be a difficult task for Google, which has been indexing the web since 1998.

But as the web grows larger, countless of new pages are created every single day. Every single one of those web pages should be crawled and indexed by Google, if it's allowed to that is. And with many active websites also update their old pages, Google also needs to spend huge resources to re-crawl and re-index them.

This process is a burden to its servers.

In order to prevent unnecessary crawling and indexing, Google tries to instead predict when web pages of websites may contain similar or duplicate content based on their URLs.

That, without having to even visit them, as Mueller explained:

“What tends to happen on our side is we have multiple levels of trying to understand when there is duplicate content on a site. And one is when we look at the page’s content directly and we kind of see, well, this page has this content, this page has different content, we should treat them as separate pages."

"The other thing is kind of a broader predictive approach that we have where we look at the URL structure of a website where we see, well, in the past, when we’ve looked at URLs that look like this, we’ve seen they have the same content as URLs like this. And then we’ll essentially learn that pattern and say, URLs that look like this are the same as URLs that look like this.”

Mueller added that:

“Even without looking at the individual URLs we can sometimes say, well, we’ll save ourselves some crawling and indexing and just focus on these assumed or very likely duplication cases. And I have seen that happen with things like cities."

"I have seen that happen with things like, I don’t know, automobiles is another one where we saw that happen, where essentially our systems recognize that what you specify as a city name is something that is not so relevant for the actual URLs. And usually we learn that kind of pattern when a site provides a lot of the same content with alternate names.”

This predictive method is good for Google in saving its resources. But for webmasters and web owners, this can affect many things:

“So with an event site, I don’t know if this is the case for your website, with an event site it could happen that you take one city, and you take a city that is maybe one kilometer away, and the events pages that you show there are exactly the same because the same events are relevant for both of those places."

"And you take a city maybe five kilometers away and you show exactly the same events again. And from our side, that could easily end up in a situation where we say, well, we checked 10 event URLs, and this parameter that looks like a city name is actually irrelevant because we checked 10 of them and it showed the same content."

"And that’s something where our systems can then say, well, maybe the city name overall is irrelevant and we can just ignore it.”

Another way of saying it, web pages with similar URLs, may be considered duplicates, even when their contents are not.

As a result, those pages could be left out of Google's crawling activity, and won't be indexed.

To fix this issue, Mueller suggests webmasters and web owners to strictly limit real cases of duplicate contents as much as possible.

On its documentation page, Google said that to avoid duplicates, websites should:

Use "canonicalization": Websites with multiple pages with largely identical content, can use the canonical tag to apply different weight to the web pages.
Use 301 redirects: Or also called permanent redirects, this will prevent Google's crawlers to crawl duplicate pages on a website. Using
Be consistent: Try to keep internal linking consistent.
Use top-level domains: This is to help Google serve the most appropriate version of a document.
Syndicate carefully: Help Google by syndicating a website's content on other website by including a backlink to the original article.
Minimize boilerplate repetition: Include a very brief summary and then link to a page with more details, rather putting lengthy information inside one page. This enables webmasters to specify how they would like Google to treat URL parameters.
Avoid publishing stubs: Don't publish pages that have no real content.
Understand The CMS being used: Webmasters need to know how contents are displayed on their website.
Minimize similar content: If there are many similar pages, consider expanding each page or consolidating the pages into one.