A robots.txt file is a tool you can use to control how the search engines see your site. Essentially, it tells search engines how to behave when crawling your content. And, they can be extremely valuable for SEO and overall site management.
Some of the things I’ll talk about in this article include:
- What is a robots.txt file?
- Do I need a robots.txt file?
- How to create a robots.txt file
- Some examples of what to include in a robots.txt file
- Using robots.txt is not a guarantee
- robots.txt and WordPress
- robots.txt File Generators
A Brief History of Web Crawlers and Grapes
Humans have short, selective memories. For example, we take Google for granted. Many often view it as if an intelligent directory of (almost) everything on the web has always been available.
But the early days of the web were dark and confusing times, brothers and sisters. There was no intelligent way to find anything.
Oh, we had search engines. WebCrawler was the first one most people heard of, and it was quickly joined by Lycos. They indexed everything they could find on the web, and they worked. In fact, they worked a little too well.
When you are looking for something specific but you have to search through everything in the world, the search results can be…less than useful. If you ever used WebCrawler, Lycos, or any of the other pre-Google search engines (hello AltaVista!), you remember the pages and pages of results that had nothing to do with what you were looking for.
Indexing Everything was Problematic
The problem with indexing everything is it can—and often did—result in useless search results. Searching for “The Grapes of Wrath” was likely to return dozens of pages of results related to grapes (the fruit) and the Star Trek Wrath of Khan movie, but nothing about John Steinbeck.
To make matters worse, very early on spammers identified the lack of sophistication in search engines and took advantage of it. This often led to loading pages full of words and phrases that had nothing to do with the shoddy products or Ponzi schemes they were trying to foist onto unsuspecting web surfers.
The technical hurdles involved in making search results “smarter” were still years from being overcome. So instead, we got things like Yahoo!, which wasn’t a search engine at all, but rather a curated list of websites. Yahoo! didn’t find websites, website owners told Yahoo! where to find them.
If that sounds terribly unscientific and not very inclusive, it’s because it was. But it was the best answer to the chaos and disorder of search engine results that anyone could come up with. Yahoo! became the de facto starting point for most people using the web just because there wasn’t anything better.
The Rise of the Machines
The “robots” we’re talking about are actually computer programs, not frightening man-machines. The programs that index the web are known by many other names as well, including spiders, bots and crawlers. All of the names refer to the same technology.
A couple of Stanford Ph.D. students named Larry and Sergey would eventually figure out how to make search results more relevant. However, there were dozens of other search engines scouring the web in the meantime. Robots scoured the web continuously, indexing what they found. But robots aren’t intelligent life forms, they are machines, so they created some problems.
Primarily, they indexed a lot of things that website owners did not want to be indexed. This included private, sensitive, or proprietary information, administrative pages, and other things that don’t necessarily belong in a public directory.
Also, as the number of robots increased, their sometimes negative impact on web server resources increased. Servers in those days weren’t as robust and powerful as they are now. A flurry of spiders and bots all furiously loading pages of a site could slow down the site response time.
The people of the web needed a way to control the robots, and they found their weapon in the humble yet powerful robots.txt file.
What Is a robots.txt File?
The robots.txt file is a text-only format that contains instructions the web crawlers and robots are supposed to follow.
I say “supposed to” because there is nothing requiring a crawler or bot to follow the instructions in the robots.txt file. The major players follow most (but not all) of the rules, but some bots out there will completely ignore the directives in your robots.txt file.
The robots.txt file lives in the root directory of your website (e.g., http://ggexample.com/robots.txt).
If you use subdirectories, like blog.ggexample.com, or forum.ggexample.com, each subdirectory should also contain a robots.txt file.
Crawlers do a simple text match against what you have in the robots.txt file and the URLs on your site. If a directive in your robots.txt file is a match for a URL on your site, the crawler will obey the rule you’ve set.
Do I Need a robots.txt File?
When there is no robots.txt file present, the search engine crawlers assume they can crawl and index any page that they find on your site. If that’s what you want them to do, you don’t need to create a robots.txt file.
But if there are pages or directories that you don’t want to be indexed, you need to create a robots.txt file. Those kinds of pages include what we talked about earlier. These are the private, sensitive, proprietary, and administrative pages. However, it may also include things like “thank you” pages, or pages that contain duplicate content.
For example, printer-friendly versions or A/B testing pages.
How to Create a robots.txt File
A robots.txt file is created the same way any text file is created. Open up your favorite text editor and save a document as robots.txt. You can then upload the file to the root directory of your site using FTP or a cPanel file manager.
Things to note:
- The filename must be robots.txt – all lowercase. If any part of the name is capitalized, crawlers will not read it.
- The entries in your robots.text file are also case-sensitive. For instance, /Directory/ is not the same as /directory/.
- Use a text editor to create or edit the file. Word Processors may add characters or formatting that prevent the file from being read by crawlers.
- Depending on how your site was created, a robots.txt file may already be in the root directory. Check before creating and uploading a new robots.txt so you don’t inadvertently overwrite any existing directives.
Some Examples of What to Include
A robots.txt file has a number of variables and wildcards, so there are many possible combinations. We’ll go over some common and useful entries and show you how to add them.
Before we do that, let’s start with an overview of the available directives: “User-agent,” “Disallow,” “Allow,” “Crawl-delay,” and “Sitemap.” Most of your robots.txt entries will use “User-agent” and “Disallow.”
The User-agent function targets a specific web crawler we want to give instructions to. That will usually be Googlebot, Bingbot, Slurp (Yahoo), DuckDuckBot, Baiduspider (a Chinese search engine), and YandexBot (a Russian search engine). There’s a long list of user agents that you can include.
Using Disallow is probably one of the most common attribute. It is the main command we’ll use, to tell a user-agent not to crawl a URL.
Allow is another common element of the robots.txt file. And it is only used by the Googlebot. It tells Googlebot that it’s okay to access pages or subfolders even though the parent page or subfolder is disallowed.
The Crawl-delay function dictates how many seconds a crawler should wait between pages. Many crawlers ignore this directive—most notably Googlebot—but the crawl rate for Googlebot can be set in the Google Search Console.
Perhaps one of the more imporant aspects of the robots.txt file is “Sitemap.” This is used to specify the location of XML sitemaps for your site, which greatly improves how content is indexed in search engines.
If you want to be found in sites like Google, Bing or Yahoo, it’s virtually a requirement to have a sitemap.
So a robots.txt file starts with:
The asterisk (*) is a wildcard meaning “all.” Whatever comes next will apply to all crawlers.
User-agent: * Disallow: /private/
Now we added “Disallow” for the /private/ directory. So robots.txt is telling every crawler not to crawl /private/ on the domain.
If we wanted to disallow only a specific crawler, we would use the name of the crawler in the User-agent line:
User-agent: Bingbot Disallow: /private/
That tells Bing not to crawl anything in the /private/ directory.
A slash in the Disallow line would tell Bing (or any User-agent you list) that it isn’t allowed to crawl anything on the domain:
User-agent: Bingbot Disallow: /
You can also tell the crawlers not to crawl a specific file.
User-agent: * Disallow: /private.html
Another wildcard is $, which denotes the end of a URL. So in the following example, any URL that ends with .pdf would be blocked.
User-agent: * Disallow: /*.pdf$
That would keep all crawlers from crawling all PDFs. For example, https://ggexample.com/whitepapers/july.pdf
Multiple Directives in the robots.txt File
So far we’ve done simple two-line robots.txt files, but you can have as many entries in the file as you’d like.
For example, if we wanted to allow Google to crawl everything but not let Bing, Baidu, or Yandex, we would use:
User-agent: Googlebot Disallow: User-agent: Bingbot Disallow: / User-agent: Baiduspider Disallow: / User-agent: YandexBot Disallow: /
Note that we used a new User-agent line for each directive. The User-agent line can only list a single crawler.
But – a single User-agent can have multiple Disallow directives:
User-agent: Baiduspider Disallow: /planes/ Disallow: /trains/ Disallow: /automobiles/
Each Disallow URL must be on its own line.
You can test your robots.txt file in Google Webmaster Tools.
Using robots.txt Is Not a Guarantee
Adding a Disallow directive to robots.txt is not a guarantee that the file or URL will not be indexed by the search engine. While the “good” search engine crawlers will respect your robots.txt settings, some will not.
Just because they don’t crawl something on your domain doesn’t mean it won’t be indexed.
That’s because crawlers follow links. So if you disallow /whitepapers/july.pdf, the crawlers won’t crawl it. But if someone else links to /whitepapers/july.pdf from their website, the crawlers could find the file and index it.
robots.txt and WordPress
WordPress creates a “virtual” robots.txt file by default. It’s a simple directive that blocks crawlers from attempting to crawl your admin panel.
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
The file /wp-admin/admin-ajax.php is allowed because some WordPress themes use AJAX to add content to pages or posts.
If you want to customize the WordPress robots.txt file, create a robots.txt as outlined above and upload it to your website root.
Note that your uploaded robots.txt will stop the default WordPress virtual robots.txt from being generated. A site can only have one robots.txt file. So if you need that AJAX Allow directive for your theme, you should add the lines above to your robots.txt.
Some WordPress SEO plugins will generate a robots.txt file for you.
robots.txt File Generators
I’m going to list some robots.txt file generators here, but really, most of them just do disallows. Now that you know how to do that yourself, their usefulness is questionable. But if you dig playing with code generators – and hey, who doesn’t? – here you go.
Visio Spark (This one also has a validator near the bottom of the page.)
Robots.txt Is Useful
Although not all search engine crawlers respect the robots.txt file, it’s still incredibly useful for SEO and maintaining the website. From ignoring specific directories and pages to browser cache settings, a lot can be done from this simple file.