What is the Robots.txt File?
The robots.txt file is a simple way to communicate to the robots like Google, Bing & Yahoo what data you want them to show about your site. It is not the best way to communicate certain things but it can help them understand more about your site.
Robots.txt the Basics
This file is made specifically for the search engines and other robots to tell them how to interact with your site. It is not required necessarily, however it is recommended to have at least a minimal version. Each command should be on a new line and a blank line indicates the end of a section. The file needs to be in your root folder for your site otherwise there is a good chance the robots will not see it and it will have no effect. When done correctly you can go to your website address followed by a /robots.txt for example examplewebsite.com/robots.txt
In the file you are specifically telling the robots three things:
- Where your sitemap is located.
- What pages they should index.
- What pages they should not index.
By default the WordPress robots.txt file tells the search engines to crawl your site but exclude the login page and the folder that has the WordPress library and core files.
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/
This is a good start as you do not want these files to show up in the search engines (however there is a better way to block them, more on that later). However there are a number of other pages that you want to add including the plugins folder, the themes folder and cgi-bin.
User-agent: * Allow: / Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/themes
The above code is our recommended best practice to start with although you will probably want to customize based on what your strategy is. Next we will cover how the file works and how to add additional pages.
How to Use the Robots.txt File
Once you understand how the file works you can make whatever changes you see fit. In a nutshell there are 4 different commands you can add:
User-agent: Allow: Disallow: Sitemap:
The “User-agent:” code defines which bot the following code is for. By default the user agent is set to * which encompasses all robots. You can choose individually which bots the code is for by using their user agent, here are a few examples:
User-agent: Googlebot User-agent: Bingbot User-agent: msnbot User-agent: Yandex User-agent: Slurp
In order of appearance the robots the above text is talking to are Google, Bing, MSN, Yandex, and Yahoo (identified as “Slurp”). Generally people will create a robots.txt file that talks to all of the robots, however if you would like them to look at different information or you just want to be extra sure you can call them out by name. Make sure you list a user agent immediately followed by their instructions, you cannot list all of them in a row like in the above example.
The “Allow:” code defines which pages the search engines are allowed to crawl. Typically you won’t need this code because they will crawl every page on your site by default, however if you have a folder you want excluded except for one file this is how you would use it:
User-agent: Googlebot Disallow: /files/ Allow: /files/album.php
The “Disallow:” code defines the pages you want excluded, however the search engines can decide to respond or ignore it. The best way to keep certain pages from being indexed is to add a nofollow meta tag on each individual page. 999 times out of 1000 this will work, occasionally however the robots will ignore both tags. In this scenario the only solution is to hide the content behind a login. Remember the robots can do whatever they want so your job is to hide sensitive information.
In addition this is not a reliable way to hide pages or files because the robots.txt file is publicly accessible so anyone can read what is in the file. Instead you should hide sensitive information behind a login in order to keep it secure.
Lastly the “Sitemap:” code tells the bots where your sitemap is located, it is always a good idea to have this in your robots.txt file because it increases the chances of the search engines to index your entire site. You can also include your video, image, and mobile sitemaps to improve the index rate of those items.
Tips and Tricks
The file overall is pretty straight forward however there are a few tricks to make it easier to use.
To block all URLs that include .pdf you would add this code:
User-agent: * Disallow: /*.pdf$
To block all URLs that contain a question mark you would use the following code:
User-agent: * Disallow: /*?
To block a certain folder use the following code:
User-agent: * Disallow: /2015/
To block a certain folder but include one file in that folder the following code would be used:
User-agent: * Disallow: /2015/ Allow: /2015/important-file.php
To block your entire site:
User-agent: * Disallow: /
To block your entire site from a certain search engine just use that engines user agent:
User-agent: baiduspider Disallow: /