We use cookies to make Serpstat better. By clicking "Accept cookies", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. Learn more

Report a bug

Cancel
2185
How-to 9 min read

How To Specify Robots.txt Directives For Google Robots

Robots.txt is a document with the TXT extension which contains recommendations for robots of various search engines, and this document is located in the root folder of your web resource.

Why a website should have robots.txt

Robots.txt commands are directives that allow or disallow scanning particular sections of the web resource. With this file, you can allow or limit scanning of your web resource or its particular pages by search engine robots. Here's an example of how directives work on a website:
Why do you need robots.txt file
The picture shows that access to certain folders, and sometimes individual files, doesn't allow scanning by search engine robots. Originally, the directives in the file are advisory and can be ignored by the search robot, but they normally take this instruction into account. Technical support also warns webmasters that alternative methods are sometimes required to prevent indexing
The limitations of robots.txt
What pages should be closed?

Normally, technical pages are closed for indexing. Cart pages, personal data and customer profiles should also be protected from indexing. They should also be protected from indexing via the robots.txt file.
How is the file created, and what directives are used?

The document is created via Wordpad or Notepad++, and it should have the ".txt" extension. Add the necessary directives and save the document. Next, upload it to the root of your website. Now let's talk more about the contents of the file.

There are two types of commands:

  • allow scanning (Disallow);
  • close scanning access (Allow);

The following things are additionally specified:

  • the crawl-delay;
  • the host;
  • map of the website pages (sitemap.xml).

You can use Serpstat Site Audit tool to find all website pages that are closed in robots.txt.
Advanced SEO Audit: A Complete Guide To All Stages Of The Analysis [Infographic]
Personal demonstration
Our specialists will contact you and discuss options for further work. These may include a personal demonstration, a trial period, comprehensive training articles, webinar recordings, and custom advice from a Serpstat specialist. It is our goal to make you feel comfortable while using Serpstat.

Characters in robots.txt

The slash "/" is used to select the whole website.

The symbol "*" means any character sequence. Thus, you can specify that scanning is allowed up to a certain folder or file:
Disallow: */trackback
The symbol "$" means the end of the line.

You can address the search engine bot via the User-Agent + the bot name to which you apply the rule, for example:
User-agent: Google
But the "User-agent:*" will mean addressing all bots, Google and other ones. Addressing the bot, you need to know its specifics as each algorithm aims to resolve certain tasks. The specifics of the most used search engines are described below.

Checking robots.txt in Google

The names used for the Google crawlers:

  • Googlebot is a crawler indexing pages of a website;
  • Googlebot Image scans pictures and images;
  • Googlebot Video scans all video content;
  • AdsBot Google analyzes the quality of all advertising published on desktop pages;
  • AdsBot Google Mobile analyzes the quality of all advertising published on mobile website pages;
  • Googlebot News assesses pages before they go to Google News;
  • AdsBot Google Mobile Apps assesses the quality of advertising of Android applications, similarly to AdsBot.

Having learned the names of search robots and management commands, let's move on to analyzing an example of how to compose a document. Well, let's turn to the search bot for Google and completely disallow the website scanning. The command will be displayed like this:

User-agent: GoogleDisallow: /

Now, let's allow all bots to index the website to provide an example:

User-agent: *Allow: /

Let's put down the link to the sitemap and host of your website. As a result, we'll get the robots.txt for https:

User-agent: *Allow: /
Host: https://example.com
Sitemap: https://example.com/sitemap.xml

Thus, we reported that our site could be scanned without any restrictions, and we also indicated the host and sitemap. If you need to limit scanning, use the Disallow command. For example, block access to the technical components of the website:

User-agent: *Disallow: /wp-login.phpDisallow: /wp-register.phpDisallow: /feed/Disallow: /cgi-binDisallow: /wp-admin
Host: https://example.com
Sitemap: https://example.com/sitemap.xml

If your website uses the HTTP protocol instead of HTTPS, don't forget to change the contents of the lines.

Here's an example of the real file for a web resource:
Example of robots.txt file
We reported with this method that all search engines have limited access to crawl specified folders. Remember that the document is case sensitive. Folders with the same character set will not be the same if you use capital letters in different ways. For example "example", "Example", and "EXAMPLE". A common mistake for beginners is to use capital letters in the file name, for example, "Robots.txt" (which is wrong), instead of "robots.txt".

Checking the accuracy of robots.txt

The document should be located only in the root folder. Placing it in the "Admin", "Content" and similar subfolders is wrong. The system will not take this file into account, and all the work will be done in vain. Be sure to upload the document correctly by going to the main page of the site and adding "/robots.txt" to the website address. Then press Enter and see if the page has loaded. The link will look like this: yoursiteadress.com/robots.txt.
Checking the robots.txt file
The 404 error page returned in response means that you saved the file incorrectly. There are built-in tools from Google that you can use to verify the correct operation of the directives themselves. For instance, the Search Console can verify file accuracy.

Go to the panel and select the tobots.txt Tester tool in the left-hand menu:
robots.txt tester in Google Search Console
In the window that opens, you can paste the copied text from the file and start scanning. Documents that are not yet uploaded to the root folder of the website are inspected this way.
How to test robots.txt file in Google Search Console
Check the correctness of the existing "robots.txt" document by specifying the path to it as shown in the screenshot:
robots.txt correctness check in Google Search Console

Conclusion

Robots.txt is necessary to limit scanning of certain pages of your website that do not need to be included in the index since they are originally technical. To create such a document, you can use Wordpad or Notepad ++.

Write down what search robots you are addressing and send them a command as described above.

Next, verify the file accuracy through built-in Google tools. If no errors occur, save the file to the root folder and once again check its availability by clicking on the link yoursiteadress.com/robots.txt. If the link is active, everything is done correctly.

Remember that originally, directives are advisory, and you need to use other methods to completely ban page indexing.

This article is a part of Serpstat's Checklist tool
Checklist at Serpstat" title = "How To Specify Robots.txt Directives For Google Robots 16261788322512" />
Checklist is a ready-to-do list that helps to keep reporting the work progress on a specific project. The tool contains templates with an extensive list of project development parameters where you can also add your own items and plans.
Try Checklist now

Learn how to get the most out of Serpstat

Want to get a personal demo, trial period or bunch of successful use cases?

Send a request and our expert will contact you ;)

Rate the article on a five-point scale

The article has already been rated by 1 people on average 5 out of 5
Found an error? Select it and press Ctrl + Enter to tell us