Google Indexing Websites

How is this crawling done? Where does Google index a site? How does it index so well that when you connect to the web and perform a Google search, you are immediately presented with a list of related results out of billions of sites on web? Which takes us to another question that how does Google locates web pages exactly matching your input query?

All the above questions can be answered if you imagine “searching on web” as glancing in a very huge book with an “extraordinarily engineered index” that tells you precisely where everything is located. When you perform a Google search, it checks its index to decide the most relevant search results and return them to you as SERP (Search Engine Results Page).

How Google Crawls

Google crawl the webpages by using Googlebot which is Google’s web crawling robot also known as the “search spider“. In the process, Googlebot discovers and learns new or updated pages that have to be added in the Google index. It also uses a complex algorithm to determine which sites it has to crawl, how often it has to crawl and how many pages to obtain from a website.

The crawling process starts with a list of URLs which were recorded in previous crawl processes combined with sitemap data which is provided by webmasters. As Googlebot goes to each of these sites it looks for links on every page (links that either directs to its internal pages or to external websites) and adds them to its list for further crawling. That is how new sites or any changes done to existing sites or dead links on a site are noted and afterwards used to update the Google index.

Googlebot find sites only by following links from page to page. Googlebot works really well with text or pictures, but with fancy features such as flash, JavaScript, session IDs, cookies and frames creates troubles for Googlebot while crawling your site.

How Google Indexes

Googlebot reads each page it crawls to bring together a huge index of the words in content it sees and their position on each page. It also processes other key elements such as content tags, Title tags, Header tags, and image ALT attributes.

Googlebot is capable of processing many content types but not all of them such as rich media files or dynamic pages. During the indexing your webpages are also saved in the Google’s cache. When someone enters any query on Google, Google searches the index for most relevant pages available and returns the result to user. This relevancy is based on over 200 factors, including the PageRank of the page. All these 200 factors are combined together to form an “algorithm” which is a secret of Google and changes from time to time.

SEOs from all around the world test and experiment different strategies to get high relevancy of their web pages to their targeted keywords so that they get higher rankings.