On the web, there are countless of crawlers that relentlessly scout for contents.
And when visiting websites, they have to obey some rules, and those rules are clearly stated on the
robots.txt. The text file, which is usually located at the root of a website, is meant to give instructions about their site to web robots.
This is called 'The Robots Exclusion Protocol' that crawlers must obey.
How it works involves crawlers to first visit the
robots.txt file before crawling the site's URL. Rules that are placed inside this file, must be obeyed by the crawlers.
For example, if a crawler wants to crawl https://www.example.com/, it needs to first see the directives inside https://www.example.com/robots. If the rules stated that it shouldn't crawl https://www.example.com/example.html, it won't crawl that page.
Here, Google said that it's stopping its support for robots exclusive protocol, meaning that the search engine won't longer support
robots.txt files with the noindex directive listed within the file.
The news followed Google's previous announcement, which stated that the company is open-sourcing Google's production robots.txt parser.
Google provided an extensible architecture for rules that are not part of the standard, meaning that if a crawler wants to support their own directives, like "unicorns: allowed", they could.
While analyzing the directive usage, Google said that unsupported rules, such as crawl-delay, nofollow, and noindex were never documented by Google. This essentially made their usage in relation to Googlebot to be relatively low.
"Digging further, we saw their usage was contradicted by other rules in all but 0.001% of all
robots.txt files on the internet," said Google, further explaining that "These mistakes hurt websites' presence in Google's search results in ways we don’t think webmasters intended."
According to Google on its Webmaster blog post:
Some of the options webmasters can use, include:
- Noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, this directive is the most effective way to remove URLs from the index when crawling is allowed.
- 404 and 410 HTTP status codes: Both status codes mean that the page doesn't exist, which will drop the URLs from Google's index once they're crawled and processed.
- Password protection: Unless markup is used to indicate subscription or paywalled content, hiding a page behind a login should remove it from Google's index.
- Disallow in robots.txt: Blocking pages from being crawled usually means its content won’t be indexed. While the search engine may also index a URL based on links from other pages, without seeing the content itself, Google aims to make such pages less visible in the future.
- Search Console Remove URL tool: The tool is help remove a URL temporarily from Google's search results.
Google has been looking to standardize the protocol for years.
And with it taking place, webmasters relying on the noindex indexing directive should make the suggested changes before September 1st, 2019. And if they are using the nofollow or crawl-delay rules, they too should look for true supported methods for those directives.