14839
How-to 18 min read December 21, 2021

How to define robots.txt?

The robots.txt utility file contains rules for indexing your website with specific (or all) search engine bots. Using this file, you can try to close the website from Google indexing. Is this going to work? Keep your answer in mind, below you can check if our ideas coincide.
#1

What is robots.txt

Robots.txt is a simple text file with the name written in lowercase, which is located in the root directory of the website:
Robots.txt file in the directory site documents
If you placed it correctly, it would open at site.com/robots.txt
You can Serpstat Site Audit to find all website pages that are closed in robots.txt.
Advanced SEO Audit: A Complete Guide To All Stages Of The Analysis [Infographic]

How to create a robots.txt file

If you want to create a robox.txt file you must have access to your domain's root directory. Your hosting provider can tell you whether or not you have the necessary permissions.

The creation and location of the file are the most crucial aspects of it. Write a robots.txt file with any text editor and save it to:

  • The root of your domain: www.yourdomain.com/robots.txt.

  • Your subdomains: page.yourdomain.com/robots.txt.

  • Non-standard ports: www.yourdomain.com:881/robots.txt.

What should be in robots.txt

Let's take a look at what should a robots.txt file consist of:

The user-agent specifies which search engines the directives that follow are intended for.
The * symbol denotes that the instructions are intended for use by all search engines.
Disallow: User agent disallow is a directive that tells the user agents what content they can't see.
/wp-admin/ is the path that the user-agent can't see.
In a nutshell, the robots.txt file instructs all search engines to avoid the /wp-admin/ directory.

Why it's important to manage indexing

If you have a website with bare HTML + CSS, that is, you manually convert each page to HTML, don't use scripts and databases (a 100-page website is 100 HTML files on your hosting), then just skip this article. There is no need to manage indexing on such websites.

But you don't have a simple business card website with a couple of pages (although such websites have long been created on CMS like Wordpress/MODx and others) and you work with any CMS (which means programming languages, scripts, database, etc.) ) - then you will come across such "trappings" as:

  • page duplicates;
  • garbage pages;
  • poor quality pages and much more.

The main problem is that the search engine index gets something that shouldn't be there, like pages that don't bring any benefit to people and simply stuff up the search.

There is also such a thing as a crawling budget, which is a certain number of pages that the robot can scan at the same time. It is determined for each site individually. With the bunch of uncovered garbage, pages can be indexed longer because they don't have enough crawling budget.

What should be closed in robots.txt

1
Search pages. If you are not going to moderate and develop them, close them from indexing.
2
Shopping cart.
3
Thank you and checkout pages.
4
Sometimes it makes sense to close pagination pages.
5
Product comparisons.
6
Sorting.
7
Filters, if it's impossible to optimize and moderate them.
8
Tags, if you can't optimize and moderate them.
9
Registration and authorization pages.
10
Personal account.
11
Wish Lists.
12
User profiles.
13
Feeds.
14
Various landing pages created only for promotion and sales.
15
System files and directories.
16
Language versions if they are not optimized.
17
Printable versions.
18
Blank pages, etc.
You need to close everything that is not useful to the user, not finished, not improved, or is a duplicate.
Even if you cannot close 100% of the issues at once, the rest will be closed at the indexing stage. You cannot immediately predict all the drawbacks that might occur, and they don't always come out due to technical issues. You need to take into account a human factor in this case.

The impact of the robots.txt file on Google

Google is smart enough, and it decides itself what and how to index. However, if you immediately close the pages in robots.txt (before the website is released in the index), then there is a lower chance that they will go to Google. But as soon as links or traffic go to the closed pages, the search engine thinks they should be indexed.

Therefore, it is safer to close pages from indexing through the robots meta tag:
<html>
 <head>
  <meta name=“robots” content=“noindex,nofollow”>
  <meta name=“description” content=“This page….”>
   <title>…</title>
 </head>
<body>

Online generators

As in the case with sitemap.xml, many people may think of generators. They are available and free of charge. However, nobody needs them.

All it does is substitutes the words "Disallow" and "User-agent" for you. As a result, you save zero time, zero benefit, and zero usefulness.
#2

Structure and proper setup of robots.txt

Robot indication

  • Directives for execution by this robot
  • Additional options

Robot Indication 2

  • Directives for execution by this robot
  • Additional options

Etc.

The order of the directives in the file doesn't matter, because the search engine interprets your rules depending on the length of the URL prefix (from short to long).

It looks like this:

  • /catalog - short;
  • /catalog/tag/product - long.

I also want to note that the spelling is important: Catalog, CataloG and catalog are three different aliases.

Let's consider the directive.

User-agent directive

The robot is indicated here, for which the rules that are described below will be relevant. The most common entry is:

  • User-agent: * (for all robots);

Google:

  • Google APIs - the user agent that the Google API uses to send push notifications;
  • Mediapartners-Google - AdSense analyzer robot
  • AdsBot-Google-Mobile - checks the quality of advertising on web pages designed for Android and IOS devices;
  • AdsBot-Google - checks the quality of advertising on web pages designed for computers;
  • Googlebot-Image - a robot indexing images;
  • Googlebot-News - Google News Robot;
  • Googlebot-Video - Google Video Robot;
  • Googlebot - the main tool for crawling content on the Internet;
  • Googlebot - a robot indexing websites for mobile devices.

Disallow Directive

Prohibit the specified URL for indexing. It is used in almost every robots.txt, because you need to close garbage pages more often, and not open separate parts of the website.

There is a website search that generates a URL:

  • /search?q=search-query
  • /search?q=search-query-2
  • /search?q=search-query-3

We see that it has a /search basis. We look at the website structure to make sure that with the same core, there is nothing important and it closes the entire search from indexing:
Disallow: /search

Host Directive

Previously, it was a pointer to the main mirror of the website. As a rule, the Host directive is indicated at the very end of the robots.txt file:
Disallow: /cgi-bin
Host: site.com
It is a useless string for Google.

Now it's enough to have a 301 redirect from the "Not main" mirror to the "Main" one.

Sitemap Directive

This is an indication of the path to the website map. Ideally, sitemap.xml should be placed at the root of the website. But if your path is different, this directive will help search bots find it.
Important! Indicate exactly the absolute path.
It is indicated as follows:
Sitemap: https://site.com/site_structure/my_sitemaps1.xml

Clean-param Directive

If your website has dynamic parameters that don't affect the content of the page (session identifiers, users, referrers, etc.), you can describe them with this directive.

Crawl-Delay Directive

If search robots highly load your server, you can ask them to come in less often. Honestly, I have never used this directive. Apparently, this is designed for hosting that stand on the balcony in the apartment or stuff like that.

It seems that you can make the bot visit the website 10 times per second by specifying a value of 0.1, but in fact, you can't.
This is how it is implemented:

  • Crawl-delay: 2.5 - Set a timeout of 2.5 seconds

Addition

The # symbol is for commenting. Everything after the given character (in the same line) is ignored.

The * character is any sequence of characters.

Use case:

You have products, and each product has reviews on it:

  • Site.com/product-1/reviews/
  • Site.com/product-2/reviews/
  • Site.com/product-3/reviews/

We have a different product, but the reviews have the same alias. We cannot close reviews using Disallow:/reviews, because the prefix doesn't start with /reviews, but with /product-1, /product-2, etc.Therefore, we need to skip the names of the products:
Disallow: /*/reviews 
The Symbol $ means the end of the line. Let's return to the example above to explain the essence of its work. We also need to close the reviews, but leave the reviews page open from George and his friends:

  • Site.com/product-1/reviews/George
  • Site.com/product-1/reviews/Huan
  • Site.com/product-1/reviews/Pedro

If we use our option with Disallow:/*/reviews - the George's review will die, as well as all his friends. But George left a good review!

Solution:
Disallow: /*/reviews/$ 
We put / at the end of the URL and set $ to indicate that our line ended on the slash.

Yes, we could just get back to George's review using Allow and repeat two more times for two other URLs, but this is not rational, because if you need to open 1,000 reviews tomorrow, you won't write 1,000 lines, right?
Personal demonstration
Our specialists will contact you and discuss options for further work. These may include a personal demonstration, a trial period, comprehensive training articles, webinar recordings, and custom advice from a Serpstat specialist. It is our goal to make you feel comfortable while using Serpstat.
#3

Six popular screw-ups at robots.txt

In fact, there is nothing complicated in digging robots.txt. It's important to use a validator, know the directives and keep track of the register. However, some errors should be avoided:
#1

Empty Disallow

It is convenient to copy Disallow when you write it a bunch of times, but then you forget to delete it and the line remains:
Disallow: 
Disallow without value = permission to index the website.
#2

Name error

Write Robots.txt, i.e., break the case of 1 letter. Everything should be in lowercase. This is where robots.txt is written.

Always spelled robots.txt
#3

Folder listing

Listing of various directories in the Disallow directive separated by commas or spaces. It doesn't work like that.
Each rule starts with a new line. And either you use the * and $ characters to solve the problem, or close each folder separately from a new line, i.e.:

Disallow: /category-1
Disallow: /category-2

#4

File listing

List of files that need to be closed.
It's enough to close the folder, and all files in it will also be closed for indexing.
#5

Ignoring checks

There are people who close pages, use complex rules, but at the same time they ignore their validator for checking their robots.txt, and sometimes they can't even check certain options in it.
A striking example of this was at the beginning of the post. There are many cases when optimizers closed entire sections of the website without double-checking the data.
Always use a validator!
#4

Example of robots.txt

I will take my blog as an example, and its robots and give comments on each line. This file has not changed since the creation of the blog, because there are outdated points:
Disallow: /wp-content/uploads/ # Close the folder 
Allow: /wp-content/uploads/*/*/ # Open folders of pictures of a type /uploads/close/open/
Disallow: /wp-login.php # Close the file. You don’t need to do this
Disallow: /wp-register.php
Disallow: /xmlrpc.php
Disallow: /template.html
Disallow: /cgi-bin # Close folder
Disallow: /wp-admin # Close all service folders in CMS
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /wp-trackback
Disallow: /wp-feed
Disallow: /wp-comments
Disallow: */trackback # Close URLs containing /trackback
Disallow: */feed # Close URLs containing /feed
Disallow: */comments # Close URLs containing /comments
Disallow: /archive # Close archives
Disallow: /?feed= # Close feeds
Disallow: /?s= # Close the URL website search
Allow: /wp-content/themes/RomanusNew/js* # Open only js file
Allow: /wp-content/themes/RomanusNew/style.css # Open style.css file
Allow: /wp-content/themes/RomanusNew/css* # Open only css folder
Allow: /wp-content/themes/RomanusNew/fonts* # Open only fonts folder
Host: romanus.com # Indication of the main mirror is no longer relevant
Sitemap: http://romanus.com/sitemap.xml # Absolute link to the site map
#5

Robots.txt file for popular CMS

I must say right away that you should not use any robots.txt files found on the network. This applies to any files and any information.

Therefore, if you find non-standard solutions or additional plugins that change URLs, etc., there may be problems with indexing and closing the excess.

Therefore, I suggest that you can get familiar and take robots.txt as the basis for the following CMS:

  • Wordpress
  • Joomla
  • Joomla 3
  • DLE
  • Drupal
  • MODx EVO
  • MODx REVO
  • Opencart
  • Webasyst
  • Bitrix

FAQ. Common questions about robots.txt

Does Google respect robots txt?

GoogleBot will no longer obey a robots.txt directive relating to indexing, according to the company. Because the robots.txt directive isn't an official command, it won't be supported. This robots.txt directive was previously supported by Google, however this will no longer be the case.

How to upload robots txt file?

The robots.txt file saved on the computer must be uploaded to the site and made available to search robots. There is no specific tool for this as the download method depends on your site and server. Contact your hosting provider or try to find their documentation yourself. After uploading the robots.txt file, check if robots have access and if Google can process it.

Conclusion

To summarize, the algorithm for working with the robots.txt file is:
1
Create and place it in the root of the site.
2
Use typical robots.txt for your CMS as a basis.
3
Add to it the typical pages described in the article.
4
Crawl your website with any crawler (such as Screaming Frog SEO Spider or Netpeak Spider) to look at the overall picture of the URL and what you have closed. There may be more garbage pages.
5
Allow the website to be indexed.
6
Monitor Google Webmasters for garbage pages and quickly close them from indexing (and do this not only with robots.txt).
Join us on Facebook and Twitter to follow our service updates and new blog posts :)

Speed up your search marketing growth with Serpstat!

Keyword and backlink opportunities, competitors' online strategy, daily rankings and SEO-related issues.

A pack of tools for reducing your time on SEO tasks.

Get free 7-day trial

Rate the article on a five-point scale

The article has already been rated by 4 people on average 4.5 out of 5
Found an error? Select it and press Ctrl + Enter to tell us

Discover More SEO Tools

Tools for Keywords

Keywords Research Tools – uncover untapped potential in your niche

Serpstat Features

SERP SEO Tool – the ultimate solution for website optimization

Keyword Difficulty Tool

Stay ahead of the competition and dominate your niche with our keywords difficulty tool

Check Page for SEO

On-page SEO checker – identify technical issues, optimize and drive more traffic to your website

Share this article with your friends

Are you sure?

Introducing Serpstat

Find out about the main features of the service in a convenient way for you!

Please send a request, and our specialist will offer you education options: a personal demonstration, a trial period, or materials for self-study and increasing expertise — everything for a comfortable start to work with Serpstat.

Name

Email

Phone

We are glad of your comment
I agree to Serpstat`s Privacy Policy.

Thank you, we have saved your new mailing settings.

Report a bug

Cancel
Open support chat
mail pocket flipboard Messenger telegramm