Search engines use Web Spiders and bots to reach your website and extract data. The extracted data then goes through a series of transformations before your Website is actually displayed in the SERP (search engine results page). Robots.txt file will tell a search engine when to index or not index a page.
What Happens when Web Spiders Visit Your Site?
When visiting a landing page, Web Spiders will ask for your website’s “/robots.txt file” and look for a “User-agent:” line that refers to it specifically. To tell a robot where it cannot search, rules for a user-agent are set up as “Disallow:” statements. Here is a preview of how it would look:
This command tells a Web Spider to completely ignore the /test and everything inside it. This means a search will not be performed in the specified directory. In this case, our “/test” directory.
This command tells a Web Spider to ignore the whole site. Usually, webmasters or site admins use this command in specific cases. For e.g., duplicated content or irrelevant content for the website which should not be displayed in the SERP.
When you leave a blank field after “disallow: “, the command tells a Web Spider to crawl the entire site with no crawling limitations.
Examples of a Robots.txt File
This example shows how to use Robots.txt commands and implement them correctly.
Assume we want all Web Spiders from all Search Engines not to crawl the /uploads folder. Then the command would look like:
This other example specifically tells only to the Google Web search bot (called GoogleBot) not to crawl the /uploads folder:
- You can easily write the codes/commands in a simple notepad file.
- To check if robots.txt can be accessed by Web crawlers, access yourdomain.com/robots.txt
- To check if the rules in robots.txt are correctly written, you can use an online checker like Google Webmaster tools (Crawl -> Blocked URLs) and follow the instructions on the page.