This site uses cookies and other tracking technologies to make possible your usage of the website, assist with navigation and your ability to provide feedback, analyse your use of our products and services, assist with our promotional and marketing efforts, and provide better user experience.

By using the website, you agree to our Privacy policy

Accept and continue

Report a bug

Cancel
102
How-to 14 min read September 24, 2019

What is robots.txt and how to set it up correctly

The robots.txt utility file contains rules for indexing your website with specific (or all) search engine bots. Using this file, you can try to close the website from Google indexing.Is this going to work? Keep your answer in mind, below you can check if our ideas coincide.
#1

What is robots.txt

robots.txt is a simple text file with the name written in lowercase, which is located in the root directory of the website:
Robots.txt file in the directory site documents
If you placed it correctly, it would open at site.ru/robots.txt

* You can use the "Site Audit" Serpstat to find all website pages that are closed in robots.txt.

Why it's important to manage indexing

If you have a website with bare HTML + CSS, that is, you manually convert each page to HTML, don't use scripts and databases (a 100-page website is 100 HTML files on your hosting), then just skip this article. There is no need to manage indexing on such websites.

But you don't have a simple business card website with a couple of pages (although such websites have long been created on CMS like Wordpress/MODx and others) and you work with any CMS (which means programming languages, scripts, database, etc.) ) - then you will come across such "trappings" as:

  • page duplicates;
  • garbage pages;
  • poor quality pages and much more.

The main problem is that the search engine index gets something that shouldn't be there, like pages that don't bring any benefit to people and simply stuff up the search.

There is also such a thing as a crawling budget, which is a certain number of pages that the robot can scan at the same time. It is determined for each site individually. With the bunch of uncovered garbage, pages can be indexed longer because they don't have enough crawling budget.

What should be closed in robots.txt

Most often this garbage is indexed:
1
Search pages. If you are not going to moderate and develop them, close them from indexing.
2
Shopping cart.
3
Thank you and checkout pages.
4
Sometimes it makes sense to close pagination pages.
5
Product comparisons.
6
Sorting.
7
Filters, if it's impossible to optimize and moderate them.
8
Tags, if you can't optimize and moderate them.
9
Registration and authorization pages.
10
Personal account.
11
Wish Lists.
12
User profiles.
13
Feeds.
14
Various landing pages created only for promotion and sales.
15
System files and directories.
16
Language versions if they are not optimized.
17
Printable versions.
18
Blank pages, etc.
You need to close everything that is not useful to the user, not finished, not improved, or is a duplicate. Study your website and see what URLs are generated on output.

Even if you cannot close 100% of the issues at once, the rest will be closed at the indexing stage. You cannot immediately predict all the drawbacks that might occur, and they don't always come out due to technical issues. You need to take into account a human factor in this case.

The impact of the robots.txt file on Google

Google is smart enough, and it decides itself what and how to index. However, if you immediately close the pages in robots.txt (before the website is released in the index), then there is a lower chance that they will go to Google.But as soon as links or traffic go to the closed pages, the search engine thinks they should be indexed.

Therefore, it is safer to close pages from indexing through the robots meta tag:
<html>
 <head>
  <meta name=“robots” content=“noindex,nofollow”>
  <meta name=“description” content=“This page….”>
   <title>…</title>
 </head>
<body>

Online generators

As in the case with sitemap.xml, many people may think of generators. They are available and free of charge. However, nobody needs them.

All it does is substitutes the words "Disallow" and "User-agent" for you. As a result, you save zero time, zero benefit, and zero usefulness.
Searching the meaning of online robots.txt generators
#2

Structure and proper setup of robots.txt

Robot indication

  • Directives for execution by this robot
  • Additional options

Robot Indication 2

  • Directives for execution by this robot
  • Additional options

Etc.

The order of the directives in the file doesn't matter, because the search engine interprets your rules depending on the length of the URL prefix (from short to long).

It looks like this:

  • /catalog - short;
  • /catalog/tag/product - long.

I also want to note that the spelling is important: Catalog, CataloG and catalog are three different aliases.

Let's consider the directive.

User-agent directive

The robot is indicated here, for which the rules that are described below will be relevant. The most common entry is:

  • User-agent: * (for all robots);

Google:

  • Google APIs - the user agent that the Google API uses to send push notifications;
  • Mediapartners-Google - AdSense analyzer robot
  • AdsBot-Google-Mobile - checks the quality of advertising on web pages designed for Android and IOS devices;
  • AdsBot-Google - checks the quality of advertising on web pages designed for computers;
  • Googlebot-Image - a robot indexing images;
  • Googlebot-News - Google News Robot;
  • Googlebot-Video - Google Video Robot;
  • Googlebot - the main tool for crawling content on the Internet;
  • Googlebot - a robot indexing websites for mobile devices.

Disallow Directive

Prohibit the specified URL for indexing. It is used in almost every robots.txt, because you need to close garbage pages more often, and not open separate parts of the website.Use case:

There is a website search that generates a URL:

  • /search?q=poiskoviy-zapros
  • /search?q=poiskoviy-zapros-2
  • /search?q=poiskoviy-zapros-3

We see that it has a /search basis. We look at the website structure to make sure that with the same core, there is nothing important and it closes the entire search from indexing:
Disallow: /search

Host Directive

Previously, it was a pointer to the main mirror of the website. As a rule, the Host directive is indicated at the very end of the robots.txt file:
Disallow: /cgi-bin
Host: site.ru
It is a useless string for Google.

Now it's enough to have a 301 redirect from the "Not main" mirror to the "Main" one.

Sitemap Directive

This is an indication of the path to the website map. Ideally, sitemap.xml should be placed at the root of the website. But if your path is different, this directive will help search bots find it.
Important! Indicate exactly the absolute path.
It is indicated as follows:
Sitemap: https://site.ru/site_structure/my_sitemaps1.xml

Clean-param Directive

If your website has dynamic parameters that don't affect the content of the page (session identifiers, users, referrers, etc.), you can describe them with this directive.

Crawl-Delay Directive

If search robots highly load your server, you can ask them to come in less often. Honestly, I have never used this directive. Apparently, this is designed for hosting that stand on the balcony in the apartment or stuff like that.

It seems that you can make the bot visit the website 10 times per second by specifying a value of 0.1, but in fact, you can't.
This is how it is implemented:

  • Crawl-delay: 2.5 - Set a timeout of 2.5 seconds

Addition

The # symbol is for commenting. Everything after the given character (in the same line) is ignored.

The * character is any sequence of characters.

Use case:

You have products, and each product has reviews on it:

  • Site.ru/product-1/reviews/
  • Site.ru/product-2/reviews/
  • Site.ru/product-3/reviews/

We have a different product, but the reviews have the same alias. We cannot close reviews using Disallow:/reviews, because the prefix doesn't start with /reviews, but with /product-1, /product-2, etc.Therefore, we need to skip the names of the products:
Disallow: /*/reviews 
The Symbol $ means the end of the line. Let's return to the example above to explain the essence of its work. We also need to close the reviews, but leave the reviews page open from George and his friends:

  • Site.ru/product-1/reviews/George
  • Site.ru/product-1/reviews/Huan
  • Site.ru/product-1/reviews/Pedro

If we use our option with Disallow:/*/reviews - the George's review will die, as well as all his friends. But George left a good review!

Solution:
Disallow: /*/reviews/$ 
We put / at the end of the URL and set $ to indicate that our line ended on the slash.

Yes, we could just get back to George's review using Allow and repeat two more times for two other URLs, but this is not rational, because if you need to open 1,000 reviews tomorrow, you won't write 1,000 lines, right?
#3

Six popular screw-ups at robots.txt

In fact, there is nothing complicated in digging robots.txt. It's important to use a validator, know the directives and keep track of the register. However, some errors should be avoided:
#1

Empty Disallow

It is convenient to copy Disallow when you write it a bunch of times, but then you forget to delete it and the line remains:
Disallow: 
Disallow without value = permission to index the website.
#2

Error of the name

Write Robots.txt, i.e., break the case of 1 letter. Everything should be in lowercase. This is where robots.txt is written.

Always spelled robots.txt
#3

Folder listing

Listing of various directories in the Disallow directive separated by commas or spaces. It doesn't work like that.
Each rule starts with a new line. And either you use the * and $ characters to solve the problem, or close each folder separately from a new line, i.e.:

Disallow: /category-1
Disallow: /category-2

#4

File listing

List of files that need to be closed.
It's enough to close the folder, and all files in it will also be closed for indexing.
#5

Ignoring checks

There are people who close pages, use complex rules, but at the same time they ignore their validator for checking their robots.txt, and sometimes they can't even check certain options in it.


A striking example of this was at the beginning of the post. There are many cases when optimizers closed entire sections of the website without double-checking the data.
Always use a validator!
#4

Example of Robots.txt

I will take my blog as an example, and its robots and give comments on each line. This file has not changed since the creation of the blog, because there are outdated points:
Disallow: /wp-content/uploads/ # Close the folder 
Allow: /wp-content/uploads/*/*/ # Open folders of pictures of a type /uploads/close/open/
Disallow: /wp-login.php # Close the file. You don’t need to do this
Disallow: /wp-register.php
Disallow: /xmlrpc.php
Disallow: /template.html
Disallow: /cgi-bin # Close folder
Disallow: /wp-admin # Close all service folders in CMS
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /wp-trackback
Disallow: /wp-feed
Disallow: /wp-comments
Disallow: */trackback # Close URLs containing /trackback
Disallow: */feed # Close URLs containing /feed
Disallow: */comments # Close URLs containing /comments
Disallow: /archive # Close archives
Disallow: /?feed= # Close feeds
Disallow: /?s= # Close the URL website search
Allow: /wp-content/themes/RomanusNew/js* # Open only js file
Allow: /wp-content/themes/RomanusNew/style.css # Open style.css file
Allow: /wp-content/themes/RomanusNew/css* # Open only css folder
Allow: /wp-content/themes/RomanusNew/fonts* # Open only fonts folder
Host: romanus.ru # Indication of the main mirror is no longer relevant
Sitemap: http://romanus.ru/sitemap.xml # Absolute link to the site map
#5

Robots.txt file for popular CMS

I must say right away that you should not use any robots.txt files found on the network. This applies to any files and any information.

Therefore, if you find non-standard solutions or additional plugins that change URLs, etc., there may be problems with indexing and closing the excess.

Therefore, I suggest that you can get familiar and take robots.txt as the basis for the following CMS:

Conclusion

To summarize, the algorithm for working with the robots.txt file is:
1
Create and place it in the root of the site.
2
Use typical robots.txt for your CMS as a basis.
3
Add to it the typical garbage pages described in the article.
4
Crawl your website with any crawler (such as Screaming Frog SEO Spider or Netpeak Spider) to look at the overall picture of the URL and what you have closed. There may be more garbage pages.
5
Allow the website to be indexed.
6
Monitor Google Webmasters for garbage pages and quickly close them from indexing (and do this not only with robots.txt).

Learn how to get the most out of Serpstat

Want to get a personal demo, trial period or bunch of successful use cases?

Send a request and our expert will contact you ;)

Rate the article on a five-point scale

The article has already been rated by 0 people on average out of 5
Found an error? Select it and press Ctrl + Enter to tell us

Share this article with your friends

Sign In Free Sign Up

You’ve reached your query limit.

Or email
Forgot password?
Or email
Back To Login

Don’t worry! Just fill in your email and we’ll send over your password.

Are you sure?

Awesome!

To complete your registration you need to enter your phone number

Back

We sent confirmation code to your phone number

Your phone Resend code Queries left

Something went wrong.

Contact our support team
Or confirm the registration using the Telegram bot Follow this link
Please pick the project to work on

Personal demonstration

Serpstat is all about saving time, and we want to save yours! One of our specialists will contact you and discuss options going forward.

These may include a personal demonstration, a trial period, comprehensive training articles & webinar recordings, and custom advice from a Serpstat specialist. It is our goal to make you feel comfortable while using Serpstat.

Name

Email

Phone

We are glad of your comment
Upgrade your plan

Upgrade your plan

Export is not available for your account. Please upgrade to Lite or higher to get access to the tool. Learn more

Sign Up Free

Спасибо, мы с вами свяжемся в ближайшее время

Invite
View Editing

E-mail
Message
Optional
E-mail
Message
Optional

You have run out of limits

You have reached the limit for the number of created projects. You cannot create new projects unless you increase the limits or delete existing projects.

I want more limits