SEO

– 26 min read –

June 27, 2020

Read later

How To Use Serpstat API To Analyze A Niche And Hypothesize In SEO: WebX.page Experience

SEO small data: CSE, Serpstat API, R magic

Vladimir Nesterenko

Head of SEO at
WebX.page

Business owners want to be provided with information from the available resources before spending money on going more in-depth. This article deals with how to use the "available" tools to get analytical data and insights based on them. We will show the general algorithm of work and get the answers to the following questions: 'is a big number of backlinks critical to get into search TOP?', is the domainRank important for the same purpose?' and 'does the number of index pages matter?'. If you're aware of this basic information, you can develop your own solutions to analyze the required indicators.

Contents

1. What's the point
2. Why this particular choice of technology
3. Data acquisition
3.1 Custom Search Engine
3.2 Serpstat API
4. Analyzing results
5. Instead of conclusion

What's the point

I work for a pretty young startup. However surprising it might seem, we faced the issue of getting the organic traffic. It is not only boring to just arrange a semantic core, calculate the keyword density, and optimize the number of characters for your article, but also non-productive. That's why we decided to collect all available data sources (both free and paid ones) and turn them into a helpful tool.

In particular, the information in this article represents one of the first developments that allowed us to take a look at new niches as a whole since there was quite a big range of data for that. At the same time, it was fast, affordable, and with the potential of enhancement. Let's check out the steps taken and look at how it all works.

Why this particular choice of technology

What tools were used in our work?

A great Google development - Custom Search Engine. According to the rules of using the search engine, you cannot crawl the SERP itself even if you're craving it. In the first place, this development is used to organize search on external websites (that's why it's called so), but we can indicate Google as a source of results and deal with data obtained (meanwhile, you're not creating any extra load for the main search results). There's an important point here: the direct parsing and CSE results can differ, but after using several thousands of keywords, I haven't noticed any dramatic differences distorting the data. To avoid distraction later, you can get Google Developer API Key and Google Custom Search Engine ID on the spot (by creating a new search system).

Serpstat SEO API. Until winter Serpstat only provided the data according to the keys in its API, so this information in terms of this article is not relevant. But the winter-released update is far more interesting because it brought Backlink API with it. I can't help mentioning that at the moment, it doesn't have any information on the specific URL (the same as the leading service), so we'll take it into account in our future pursuits. You can get your API token on any page of this section or in your personal Serpstat account.

The storage place is a MySQL database. It's simple, convenient, quite universal.

The language to acquire the data is Java. It's just because I know its syntax, and I don't want to study Python or JavaScript for one case (although the correct choice would be Python for most cases these days, as it's a more popular and easier option).

Data is processed and managed with simple scripts made on R. At this moment, you can start downloading RStudio and an R package itself that will process everything you're going to write in RStudio (both links refer to the latest Windows packages. Everything is more simple in nix-systems and is solved with the help of terminal).

So, here's a complete set of tools. Let's get started.

Personal demonstration

Our specialists will contact you and discuss options for further work. These may include a personal demonstration, a trial period, comprehensive training articles, webinar recordings, and custom advice from a Serpstat specialist. It is our goal to make you feel comfortable while using Serpstat.

How To Automate And Speed Up Your SEO Tasks With Serpstat API: A Step-By-Step Guide From Flatfy

Data acquisition

First of all, you should decide on which data you want to get. Using CSE you can get a position on SERP, URL, Title, and a range of other parameters. I suggest that we stick with the ones mentioned above and note the keyword used to run a search.

Serpstat API will help us to take data out of getSummaryData section, since it returns about 30 parameters (51 in total, but some are not active anymore and will be removed in the nearest updates) and takes very few limits (1 per each analyzed URL). It's also worth mentioning that very large websites (Google, Wikipedia, Youtube, etc.) are not supported and return errors (we'll talk about it later).

If you want to learn more about my solution, take the following steps:

Install MySQL on the local computer and/or create a new database, add a user with ALL_PRIVILIGIES access rights to manage it.

Download Java JDK (1.7+ version) .

Copy the repository from GitHub and work with it, or extract the files from here and launch the 'start' file.

When activated the first time, the program will ask for the name of the database created earlier, its addresses (note! the address should have the format of ip:port without specifying the database name in the URL) and the user's authorization data for further work.

Get a list of keywords prepared for further work. The article goes on to use the semantics connected to 'website builder' :)

As I've described earlier, at this point we're using the Java + MySQL connection.

Custom Search Engine

While working with CSE, we're going to use REST-based API that allows emulating queries in Google and getting responses in the convenient JSON format. There are some restrictions on using Google Developer API Key: 100 queries a day. This number can be increased to make your work more comfortable, but it's more than enough to get acquainted with the process.

The basic settings in the console are pretty simple. There are only two main points you should pay attention to:

How To Use Serpstat API To Analyze A Niche And Hypothesize In SEO: WebX.page Experience 16261788415969

One more important point is that personalized search doesn't work in CSE. Ideally, you should decide in advance if that's crucial for you or not.

Here are the highlights for you if you're going to write your own solution:

You can find free-of-charge libraries to work easily with CSE almost for all programming languages.

For one query you get 10 search results returned. Without using special parameters for each keyword you can get top-10. I was acquiring top-20, which is possible due to the repeat request with "&start=n" parameter (for instance, start=1 means you will receive the results from 1-10 positions, while "start=11" — 11-20).

Google provides a wide range of parameters to get more precise search results: from the region (google.com by default) to ip-address of the device that is used to send a query (but still without personalized results).

If you have launched a program from my repository, after entering all the required data you'll be requested to enter a keyword for analysis (one at a time). After that you will see the log of how it was processed. Each query acquires top-20.

Serpstat API

If you're planning to use it, you should pay for at least the minimum Serpstat pricing plan. Sure, there are some restrictions (Lite plan allows 1500 queries to API per month), but it's just enough to analyze 1500 backlinks during the introduction period.

At the time of writing the article, getSummaryData report is giving away 51 parameters + some general information via API: how many limits are left, how many pages there are in the current results, what the sorting order is, etc., which are of little importance for us.

Check Serpstat Plans

Note: at the moment the range of parameters is marked as 'outdated'. That means at the current moment they return 0 as a result and in one of the following updates they'll be removed. Follow the updates!

Note: at the moment this report is only available for domains, page support will be added later. So you should cut the links received in CSE to domains (or you can only take them). That can slightly affect the results of the analysis!

In my solution I record and store all the data solely for simplicity, but the most important and frequently used ones are:

referringDomains;

referringSubDomains;

referringLinks;

totalIndexed;

nofollow&dofollow links;

outlinksTotal;

domainRank;

reports on how the above-mentioned parameters have changed.

You should make a query to API as a POST-query with embedded JSON containing authorization data (API key) and query parameters (there are 3: the used method, id, and the array including the analyzed domain itself).

Query example:

Endpoint: https://api.serpstat.com/v4/?token=YOUR_TOKEN_HERE

POST body:

{
 "id": {{webx.page}}, /*id can be any set of letters and numbers, I usually use the domain of the working project*/
 "method": "SerpstatBacklinksProcedure.getSummary", /*the used method*/
 "params": {
     "query" : {{domain}} /*the analyzed domain received in CSE before*/
  }
}
Here’s the example of the answer received (no outdated data, just the useful ones):
{"id":"webx.page",
"result":{
"data":{"referringDomains":320,
"referringSubDomains":30,
"referringLinks":153445,
"totalIndexed":1591,
"externalDomains":321,
"noFollowLinks":21219,
"doFollowLinks":219879,
"referringIps":71,
"referringSubnets":57,
"outlinksTotal":50330,
"outlinksUnique":525,
"typeText":240651,
"typeImg":447,
"typeRedirect":0,
"typeAlt":0,
"referringDomainsDynamics":1,
"referringSubDomainsDynamics":0,
"referringLinksDynamics":210,
"totalIndexedDynamics":0,
"externalDomainsDynamics":0,
"noFollowLinksDynamics":2,
"doFollowLinksDynamics":1009,
"referringIpsDynamics":2,
"referringSubnetsDynamics":1,
"typeTextDynamics":1011,
"typeImgDynamics":0,
"typeRedirectDynamics":0,
"typeAltDynamics":0,
"threats":0,
"threatsDynamics":0,
"mainPageLinks":393,
"mainPageLinksDynamics":0,
"domainRank":26.006519999999998}
}

As you can see, there is quite a lot of data. The database is growing fast and there is, indeed, a lot to analyze.

For starters, 1500 queries available in the cheapest pricing plan would be perfectly enough. They'll give you detailed information on 75 keywords (20 search results for each).

Note: you shouldn't forget about the number of queries per second within the restrictions of Serpstat API. It is one query per second in the minimum pricing plan. I'm in the habit of making a pause for 10 seconds in my solutions, as something may go wrong anytime :)

How To Automate The Work Of An Internet Marketer: All Tips And Tricks Of Serpstat API

To sum up

So, here's a set of a beginner in SEO science:

A parser that acquires information from CSE and records it into the base.

A parser that acquires information on domains mentioned in the previous point from Serpstat and records it into the base.

The base itself.

Personal demonstration

Analyzing results

It's time for the most interesting thing: create a data frame out of the results received and study them inside out, from every angle. Let's use the pre-downloaded R distribution and R Studio - the best IDE to work with R. A big advantage is that R is an open-source project, so it has lots of developed solutions, modules and libraries to work with data.

Actually, there are pre-built and customizable solutions (PowerBI, Tableau, etc.), but they are mostly paid and not fully customizable. After all, it's much more interesting to do everything manually minimizing the time spent.

Let's get to work

So, the table contains the data on 75 keywords for the 'website constructor' topic (in my case: excluding duplicates, implicit duplicates and explicit trash) and on 1500 domains from the top-20 connected to those keywords (actually, a bit less — 1420 due to lack of information on the 'large' domains in Serpstat, but it still took 1500 limits).

The R Studio interface may seem a bit bulky at first glance. That feeling should vanish just after the first graph made on your own :)

For convenient work you'll need to take data from the database, reformat it into the dataset (a convenient data unit in R) and go on analyzing.

All the following scripts are also attached to the repository with the rest of the code. You can copy them there.

I store the data in two different tables: google_results and serpstat_results correspondingly. The column titles fully reflect the ones in JSON-responses of the respective API. So first, they should be brought together in the manageable form, and only then they should be imported.

To avoid struggling with the third database, let's perform a join (use id as some duplicate-related errors may occur while linking). To use SQL bases when working with R, you should install an additional RMySQL library. Let's call the general range of data 'Results'.

The first script will look like this:

install.packages('RMySQL')
library(RMySQL)
#write a query
##gr - google_results, sr - serpstat_results
#note that to avoid duplicating the columns, you should clearly
#specify which columns you are using from serpstat_results
Query <- "
    select 
  gr.*,
    sr.referringDomains,
    sr.referringSubDomains,
    sr.referringLinks,
    sr.totalIndexed,
    sr.externalDomains,
    sr.noFollowLinks,
    sr.doFollowLinks,
    sr.referringIps,
    sr.referringSubnets,
    sr.trustRank,
    sr.citationRank,
    sr.domainZoneEdu,
    sr.domainZoneGov,
    sr.outlinksTotal,
    sr.outlinksUnique,
    sr.facebookLinks,
    sr.pinterestLinks,
    sr.linkedinLinks,
    sr.vkLinks,
    sr.typeText,
    sr.typeImg,
    sr.typeRedirect,
    sr.typeAlt,
    sr.referringDomainsDynamics,
    sr.referringSubDomainsDynamics,
    sr.referringLinksDynamics,
    sr.totalIndexedDynamics,
    sr.externalDomainsDynamics,
    sr.noFollowLinksDynamics,
    sr.doFollowLinksDynamics,
    sr.referringIpsDynamics,
    sr.referringSubnetsDynamics,
    sr.trustRankDynamics,
    sr.citationRankDynamics,
    sr.domainZoneEduDynamics,
    sr.domainZoneGovDynamics,
    sr.outlinksTotalDynamics,
    sr.outlinksUniqueDynamics,
    sr.facebookLinksDynamics,
    sr.pinterestLinksDynamics,
    sr.linkedinLinksDynamics,
    sr.vkLinksDynamics,
    sr.typeTextDynamics,
    sr.typeImgDynamics,
    sr.typeRedirectDynamics,
    sr.typeAltDynamics,
    sr.threats,
    sr.threatsDynamics,
    sr.mainPageLinks,
    sr.mainPageLinksDynamics,
    sr.domainRank 
from google_results gr  join serpstat_results sr on gr.id = sr.id
;
"

To make a successful query, you should join the same database that stores the results.

Replace the USER_NAME with your database user.

Replace the PASSWORD with the password of the specified user.

Replace MY_DB with the database name (it was specified in the main program when it was initialized).

Hereinafter it is assumed that the database is on the localhost: Conn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB").

Make a query itself and save the result received: Results <- dbGetQuery(Conn, Query).

Disconnect from the database: dbDisconnect(Conn).

Then, using the simplest functions you should check if the data have been obtained, and if the analytical tool operates overall:

after $ specify the column title according to which you'll receive the data: summary(Results$totalIndexed).

expected conclusion:

and one more of the basic functions: quantile(Results$totalIndexed, probs = c(1, 10, 25, 50, 75, 95, 99, 100)/100).

expected conclusion:

How To Use Serpstat API To Analyze A Niche And Hypothesize In SEO: WebX.page Experience 16261788415970

The data have been received, so let's create a data frame and get to serious tools. Actually, it isn't a necessary step, but this type of data is the one required to work with a range of packages. Along with this, I recommend installing the set of packages you will use to deal with and provide the data.

install.packages("ggplot2")

install.packages("plotly")

df <- as.data.frame(Results)

The big advantage of R, in this case, is having the source object ('Results' that contains the data imported from the database), the new one (df, data frame with the same results) and you can use either of them if necessary. So now nothing will prevent you from running the analysis.

What is the outcome

R features are really extensive. Even 10 articles are not enough to overview all of them (at least because there are hundreds or maybe thousands of them already). That's why I suggest we look at the most basic things that help us visualize the particular data.

Also, I'm not going to describe the basic syntax (it can be easily googled or found in the official documentation) and the mathematical models applied (it would be good to fill the gaps in understanding the basic maths analysis if there are any, or to do a relevant course. Or as I did — ask a Data Scientist to explain the moments which I failed to understand :) )

For instance, you can build a graph of positions on SERP on the selected keywords and look at the number of incoming links to the domain that is cumulative for each position (remember about Serpstat limits? Here's the moment when the data may be slightly inaccurate).

library(plotly)
pg <- plot_ly(df, x = ~position, y = ~referringLinks, type = 'bar')
pg

We were summing up the values on the previous graph, and we can calculate the average on the following one, which will be more correct:

A very interesting thing is that you can view how the data of different types have been distributed. For example, boxplot allows rendering the relationship between the position and the number of incoming links. Here the circles and strips of all main boxes and their 'whiskers' represent the maximum position within the selected range of backlinks, while the horizontal strips inside the boxes show the median values of positions on SERP with the corresponding number of backlinks. Accordingly, you can notice a wide variation and assume 2 things:

You need a bigger range of data to analyze — definitely.

The number of backlinks in the selected semantics has a limited effect on the position — you should double-check it after receiving a new portion of data.

How To Use Serpstat API To Analyze A Niche And Hypothesize In SEO: WebX.page Experience 16261788415971

But since it's the first meeting with the technology, I should note the feature that lets you receive important data even before rendering.

For instance, tapply helps to view the median values of the backlinks count for domains, unique domains and indexed pages that are listed on the first and second SERP:

As you can see, in the case of backlinks and domains things are not that simple — you need less of them to get to the first page than to the second one. However, there should be definitely more indexed pages. Accordingly, it can be assumed that the quantity and quality of the pages seen by the search engines matter much more than the number of domains referred to them.

But links are not the only important thing. Serpstat has an interesting parameter — domainRank (we might never learn how it is calculated), which can be used to hypothesize in different ways.

First, let's check if the data have been received correctly and we can deal with them:

Manual check confirms that the range of data really has the maximum rate — 50.00.

And let's check the basic information with tapply:

Here we can also find out that the domains with the lower domainRank get more to the first page than the second one.

It means that there are many active competitors beyond TOP, and they're increasing their rate to be ready to get into top-10.

Let's check the same thing but in terms of positions, not the whole pages. It'll help to make sure the calculations are right.

How To Use Serpstat API To Analyze A Niche And Hypothesize In SEO: WebX.page Experience 16261788415972

As a result, we've used two different Serpstat API indicators without investigating them manually and trying to fit something in Excel that cannot fir there. At the same time, they both helped us to make interesting conclusions and realize that in the niche under study we should pay attention more to the quality of content on separate pages rather than some pop indicators and rates.

After providing access to the data by a particular link, you can make more accurate predictions (e.g., calculate the median of backlinks to get to the first position) that will have a good impact on work. Or you can study the rest of the indicators to detect some other patterns.

Instead of conclusion

The given example of acquiring and dealing with data is just a small part of the features you can take advantage of. In our work we use the datasets acquired by the analysts from different departments, but that's a topic for another article :)

What's important is that you get an opportunity to use the powerful SEO tools for a minimum amount of time and resources spent (you should only pay for Serpstat API plan). And that is the base for other data (more extensive) which can be layered gradually.

It's worth remembering that search engine ranking algorithms are changing all the time both in general and in separate niches (just remember the medicine update). So you should be able to keep up with the changes. Data analytics from different sources allows doing that in time and even in advance.

Let all of you have high positions, no sanctions and nice user behavior on your websites! :)