SEO 10 min read April 18, 2017

How to Carry Out Keyword Clustering via Serpstat


Dmitriy Mazuryan
SEO Specialist at Netpeak Agency
Keyword clustering is an essential part of creating a semantic core of the site. Doing such work manually via Microsoft Excel or Google Sheets takes a lot of time. In this article, I'll share my personal clustering algorithm that will help speed up the clustering process.


What is keyword clustering?

Keyword clustering is a practice SEO specialists use to segment target search terms into groups (clusters) relevant to each page of the website. The keywords should be grouped based on the properties of objects these keywords describe and the context of their use.

But, unfortunately, there are no open bases that contain this info. Even API Knowledge Graph cannot cope with this task. Thus, keyword clustering is carried out based on SERP results through comparing the search results for different keywords.


Downsides of using common keyword clustering algorithms

There are 3 main algorithms of keyword clustering:

  • Soft
  • Moderate
  • Hard

The hard one is used the most, so we'll focus on it. Here is how it works:
1
A minimum number of pairs for which the keywords can be combined into a group is set;
2
Keywords are sorted by frequency in descending order;
3
The keywords are compared starting with the most frequent one;
4
If the total number of URLs in search results is more or equal to the minimum, the phrases are paired.
Here is a visual representation of the algorithm's work:
For more information about keyword clustering and standard algorithms visit Wikipedia.

This algorithm has a significant disadvantage ─ clusters are formed by the minimum number of matches. To prove it I have an example of incorrect work of this algorithm. Let's take 3 keywords with сonnection strength 3 and here is what we get:
As you can see the keyword #1 and keyword #2 will be in the same cluster. While keyword #3 will be grouped with keyword #1 having no mutual URL with it. Or it will form the new cluster without keyword #2. Anyway, the clustering won't be precise.

That's why I use my clustering algorithm based on keywords' сonnection strength depending on search results specifics.


How does my algorithm work?

1
Every URL has its own weight depending on its position in SERP. The weight number are identical to those used by Serpstat when calculating CTR based on positions.
2
Keywords' сonnection strength is a sum of mutual URLs' weights. While mutual URLs' weight is a sum of URLs' weights of this cluster.
3
Each cluster has two parts: the main and the additional one. The main part is formed from the keywords with the maximum connection strength but more that 2.5. While the additional is formed from the keywords which connection strength is not a maximum one but is also more than 2.5.
This algorithm helps to carry out more accurate keyword clustering and understand the connection strength between each keyword of the cluster at the same time. As a result, we get connection strength matrix whereby keywords clusters will be formed. Here is an example of how such matrix looks like:
Based on this matrix we get two clusters where keyword #1 and keyword #3 form the basis:
Keyword #1 and keyword #2 form the basis of cluster's #1 main part because of the highest connection strength between them. While the additional part of this cluster includes keyword #4 because the connection strength between the keyword #1 and keyword #4 isn't the maximum one for keyword #4, but is more than 2,5.

Cluster 2 has only the main part because there is a maximum connection strength for the keyword #4 while keyword #5 has better connection strength with keyword #4, which already forms the basis of cluster #2.

I'll try to explain it by showing the weight of every URL in brackets.
In this case, the connection strengths matrix is the following:
Keyword #2 and keyword #3 form the cluster's basis but keyword #3 still enters the cluster's additional part with keyword #1.

By using connection strength during clustering not only the number of mutual URLs, but the features of search engines are taken into account. This allows getting more qualitative keywords' clusters. It will be useful for you while designing the site's structure, writing an article or working on PPC campaign.

This algorithm can be improved to make clustering even more accurate:

1. Decreasing the weight of the main pages

The weight of main pages is usually much higher than the weight of other ones because of its structure and number of links. Take top-1000 sites with the highest Serpstat's visibility and compare the number of keywords the main and other pages are ranking to see for yourself.

2. Decreasing the connection strength in case there are several pages of the same website in top 5.

If the niche leaders can move the different pages of their site to the top, the connection strength of these keywords is not so high.


Script based on Serpstat's database

Serpstat's database contains tens of millions of Google tops. I created a small script for keywords clustering based on this algorithm and API Serpstat.

You've already seen this script in my last article "Expired domains' Search: how to find drops and identify potential drops". I just added the clustering feature.

how to find drops and identify potential drops
  • Input is a phrase, a domain or a page for which the script will get phrases from Serpstat base;
  • Input Type — here you select the input type the script will run with. It depends on what function of API Serpstat will be used;
  • Search region is a search engine for which the analysis will be carried out. For example, for the US Google, you need to set the g_us. The entire list of available search engines can be found here;
  • Search limits — the maximum number of phrases from the organic issue, which will participate in the analysis;
  • Pagination Size — the parameter required for pagination when working with API Serpstat, because keywords, url_keywords, and domain_keywords functions may give a maximum of 1000 phrases. If you have a key limit of less than 1000, then it's better to use the same page size as the search limit;
  • Max volume is a max frequency of phrases from both databases, which will participate in the analysis. If you want only LV keywords, you can set 20. For example, to search for blogs and satellites I set the maximum frequency of not more than 80;
  • API token — here you need to enter your token for API access. It can be found on your profile page;
  • Function — this script implements a number of functions.
○ Find drops via WHOIS — unique domains table based on the Whois data;

○ Get list of domains. You may just copy this list and work with it as you want;

○ Find relevant forums slightly improved search engine of topical forums;

○ Clustering.

The clustering process takes quite a long time. That is one of the reasons why the results are not displayed in Google Sheets.

After a while you'll see the spreadsheet where the yellow lines stand for the clusters' additional parts.

Here you see the result for "Clash of Clans" keyword. If I were writing an article about the Clash of Clans game strategies, I would surely take into account that the keywords "strategy" and "tips" have a significant connection strength. Classical algorithms are unlikely to let you know this.
I prefer to run the keywords from Serpstat database through this script. If you have access to Serpstat's API, you can do the same.


P. S.

As tradition requires, I'm sharing my scripts with you.
The online version is hosted on a regular weak server, and it won't cope with parsing large numbers (>10,000) of keywords. So I recommend downloading the source code and using it on your own service for more reliability.
Note: If script doesn't return any data you probably didn't fill the form correctly or your API token is inactive.

I don't claim that my algorithm and clustering script are perfect. But if you work with Serpstat's database often, it will help you to save time on processing the data manually. I hope this algorithm will be useful for you. If you have any questions, feel free ask them in comments.

Recommended articles

Subscribe to our newsletter
Keep up to date with our latest news, events and blog posts!

Comments

Sign In Free Sign Up

You’ve reached your query limit.

Or email
Forgot password?
Or email
By clicking 'Sign Up Free' I agree to Serptat's
Terms of Service and Privacy Policy
Back To Login

Don’t worry! Just fill in your email and we’ll send over your password.

Are you sure?
Please pick the project to work on

Personal demonstration

Serpstat is all about saving time, and we want to save yours! One of our specialists will contact you and discuss options going forward. These may include a personal demonstration, a trial period, comprehensive training articles & webinar recordings, and custom adivce from a Serpstat specialist. It is our goal to make you feel comfortable while using Serpstat.

Name
Email
Phone
Skype or Google Hangouts
Comment

Upgrade your plan

Sign Up Free

Спасибо, мы с вами свяжемся в ближайшее время

Invite
E-mail
Role
Message
Optional

You have run out of limits

You have reached the limit for the number of created projects. You cannot create new projects unless you increase the limits or delete existing projects.

I want more limits

Christmas is a time for miracles.

You are almost on the finish line of our Christmas quest. The last brick of your lego-promocode is left on the way up 55% discount.

Did not find previous lego-bricks? Fill the form anyway.

Name
Email
Phone
Skype or Google Hangouts
Write your discount code