Keyword Clustering and Text Analysis
Check out a featured guide on the Keyword clustering and Text analysis tools:
Keyword clustering and Text analysis consist of two tools:
1. Keyword clustering - a mass grouping of uploaded keywords based on their semantic similarity.
2. Text analysis - URL analysis and recommendations on the page SEO (under development currently).
Keyword clustering is the process of grouping a set of keywords in such a way that keywords in the same group (called a cluster) are more similar to each other than to those in other groups. The level of similarity/differentiation depends on the set parameters.
Why do you need keyword clustering?
- Grouping of semantically related keywords
- Reliable automatic analysis of a set of keywords
- Collecting the right keywords for specific pages
- Creating a site’s SEO architecture.
- Searching for keywords that are in no way related to the topics of clustered keywords
The most fundamental drawback of existing keyword clustering tools is that the resulting clusters may either contain keywords without a strong semantic similarity, or the analyzed
Unlike many competitors’ solutions, Serpstat employs intelligent hierarchical clustering where clusters are combined in a supercluster. This being said, no preliminary data collecting like keyword search volumes required, you only need to upload a list of keywords and choose the region and clustering parameters. Serpstat clustering tool doesn’t set a cluster center (a keyword with the highest search volume which is compared with other keywords to detect the number of matching URLs in SERP) — Serpstat is looking for connections among all clustered keywords.
Let’s look in detail at the main settings of the tool.
In fact, there are only two of them: Weak/Strong and Soft/Hard.
Weak parameter tells the system that in order to be combined into a cluster, the keywords must have at least 3 common URLs in Top 30 search results for a keyword, while Strong sets 7 common URLs as a condition for keywords merging into a single cluster.
The next clustering parameter choice is Soft/Hard.
Soft tells the system that a cluster can be created if at least one pair of keywords
Hard requires all keywords in a cluster to have 3 or 7 common URLs in top 30 search results for a keyword (the requirement for the number of common keywords is defined on the previous step where you selected Weak or Strong clustering). The resulting clusters contain synonymous keywords with a high semantic similarity. At the same time, this clustering method produces lots of clusters as the keywords can be merged into a cluster only if they are closely related.
Strength shows how closely a keyword is semantically related to the cluster’s topic on a scale from 0 to 1.
Upon clustering completion, a portion of the initial set of keywords can be seen in the Unsorted directory. These are objects that haven’t got to any cluster. One reason for this can be that the keywords have no semantic similarity to the topic of the analyzed keyword set and should be removed from the dataset. An alternative solution is to create separate pages for these keywords or move them to one of the created clusters if you believe they belong there.
Which clustering method is right for you?
The decision should be based on the semantic similarity of the objects from your dataset.
If the keywords are initially closely related, for example, sneakers of different brands, you may want to choose Strong+Hard or Strong+Soft so that only the closest synonyms are combined into a cluster. You’ll get lots of clusters to use for separate pages or specific categories.
In the case of various products and services, for example, a keyword collection for multi-product store or medical center with a full range of health-care services, it’s worth selecting Weak+Soft. The choice of Strong+Soft will produce more clusters and a possibility to get more topic-specific clusters.
Meta-top is a list of major competitors in SERP for keywords from a cluster. The higher a page’s rank in the meta-top, the more relevant it is to the cluster’s topic.
Setting up a clustering project
Go to the Tools section and open Keyword clustering and Text analysis.
Click Create a project.
Name your project and input a domain name (optional).
Input a list of keywords or upload them from a file.
Choose a search engine and region.
Finally, choose Linkage strength, Type of grouping and click Finish.
The resulting clusters will look like this:
Where 3 is a cluster, 2 — supercluster, and 1 — protocluster.
Supercluster is a set of clusters. It combines keywords with a high semantic similarity score, but slightly less similar than keywords in a cluster.
Protocluster is a set of superclusters. Generally, protocluster is made up of superclusters related to a specific category of objects. For example, if you’re developing SEO architecture for a multi-product store, then one protocluster may contain superclusters associated with different types of refrigerators, and the other — microwave ovens of different brands. Protoclusters are designed to streamline the work with superclusters.
Here's the breakdown of the above figure:
1. Every keyword from a cluster has its connection strength. It provides a hint of how close that keyword is to the cluster's topic on a scale from 0 to 1.
2. Homogeneity shows the semantic consistency of a cluster of a scale from 0 to 1.
3. If you specified a domain while creating a project, we'll look at your website's pages and display the page which is the closest to the cluster's topic in the URL field. If you
You can launch Text Analytics for any keyword cluster.
Each cluster has a drop-down menu:
1. Add keywords — opens a window where you can add some keywords to the existing cluster.
2. Search keywords — opens a search box where you can look for specific keywords in the cluster.
3. Delete keywords — deletes checked keywords from the cluster.
5. Delete group — deletes the cluster from your project.
Serpstat Text Analysis (hereinafter TA) is designed to provide recommendations on how to improve your on-page SEO - what changes or amendments you need to make on your page to better optimize it for keywords from a cluster or what keywords you should insert into page contents if you’re doing a page SEO from scratch. It is available for the following languages: Russian, Ukrainian, English, German, Bulgarian.
TA analyzes the text on the landing page (if a URL has been specified), the list of keywords from a cluster and a set of pages from the Top 15 search results for keywords from the list. We assume that the search engine considers the text on those pages relevant to the researched search queries if the pages are displayed in Top 15 search results.
If a target URL is specified, the TA tool analyses text content of your page and suggests lexical items to be added to the page. The suggestions are based on the text content of top pages for keywords from the cluster.
If you didn’t specify the URL, recommendations are made upon researching the largest group of related competitors - in this case, Serpstat can’t know for sure that a proper group of competitor URLs has been selected; for example, we’re not able to identify properly if informational or commercial pages are your direct competitors, since the search results for keywords from the cluster may contain different types of pages, and the right group of pages can only be selected through analysis of text on your page. Also, the report won’t display your relevance to keywords in comparison with competitors’ relevance
Serpstat TA algorithms stand out for their ability to eliminate semantic noise and prevent distortion of text analytics results by irrelevant search queries or search results. Serpstat splits up a set of top pages for keywords from the cluster into groups based on their content: videos, informational or e-commerce pages, catalogs etc, and identifies which group your landing page belongs to. Upon that, TA analyzes the text content of a selected group of URLs and provides suggestions on what on-page text units can be added or modified to boost rankings.
This intelligent selection of analysis objects allows avoiding inclusion of irrelevant URLs into the researched dataset. Filtering can even proceed to the level where pages that contain videos are included or excluded from the data set depending on whether our page’s main content is a video, for which we’re selecting a title, description, page Title, etc.). In contrast, other text analytics tools analyze the whole set of URLs from the SERP without paying attention to the page topic which impacts the recommendations validity in a negative way; imagine you’re researching the relevance of your e-commerce page to the keyword ‘buy
Serpstat TA splits the text area of a page into three major parts: Title, H1, Body. SEO recommendations are also made for specific areas. Page relevance to the cluster keywords comes as a standalone metric for each of the researched keyword. In simplistic terms, TA analysis proceeds the following way: first, Serpstat collects unique keywords from the pages’ respective areas (Title, H1, Body), then creates groups of pages based on their topical similarity, and lastly, provides recommendations for on-page SEO and page-to-keyword relevance scores.
In our TA, we abandoned a common practice of suggesting a number of keyword occurrences on a page in favor of a relative keyword importance score to a particular topic in percentage.
Choose a keyword cluster you’d like to analyze, input a URL and click Start analysis.
Upon completion, click See results.
Let’s take a look at the first section of the report. It presents a list of the analyzed keywords, Proximity level, and Relevance.
Proximity level — a score on a scale from 0 to 100% that displays how close a keyword is in terms of semantic similarity to the cluster topic.
Relevance — a relative score that shows how well your page is optimized for the keyword comparing with top pages from the SERP. Hover over the color strip to view a minimum, average and maximum relevance score for the keyword among your competitors and your relevance to the keyword as well.
Below are recommended words for Title, H1 and Body sections of your page.
Recommendation words — words suggested for your Title if you see a Not included mark under the Status column. We display words’ lemmas in recommended words, but you can use these words in any form tailored to your text intent.
LSI Rank — word’s importance for Title on a scale from 0 to 100%. It’s calculated as a ratio of the recommended word occurrences to occurrences of other words from the ‘’bag of words” we collected from Titles of competitor URLs.
Chance — the percentage of competitors from the analyzed group that has the word in Title.
Status — the column may have three values. Included - the word is present in Title on the landing page. Not included - the page doesn’t have the word in its Title.
Recommendations for H1 and
The last part of the report presents the text from the Body area of the landing page. Upon analyzing the Body text on a group of competitor URLs, we provide insights on the minimum, average and maximum number of words in Body area among your competitors and the number of words on your page.