Sign in Sign Up

We use cookies to make Serpstat better. By clicking "Accept and Continue", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts.

By using the website, you agree to our Privacy policy

Report a bug

Cancel
1611 56
SEO 18 min read November 4, 2020

Search Quality Metrics: What They Mean And How They Work

Search Quality Metrics: What They Mean And How They Work
Анастасия Сотула
Stacy Mine
Editorial Head at Serpstat
Learning to Rank is used in the field of information retrieval for natural language processing and data mining. Since the inception of search engines, significant progress has been made in this area: from naive search to the most complex algorithms and machine learning.
This article will talk about what we know about the metrics that search engines use, fundamental problems, and existing learning approaches.
What is Learning to Rank
Let's start with the main point in the question - ranking. Ranking is a way of sorting items (sites, videos, images, news posts, etc.) based on their relevance.

By relevance, we mean the degree to which an object is related to a specific query. Suppose we have a request and several objects that correspond to it in one way or another. The higher the degree of compliance of the object with the request, the higher its relevance. The task of ranking is to return the most relevant object in response to a request. The higher the relevance, the higher the likelihood that the user will take the targeted action (go to the page, buy a product, watch a video, etc.).

With the development of information retrieval systems, the topic of ranking becomes more and more relevant. This problem arises everywhere: when distributing search results pages, recommending videos, news, music, goods, and more. Learning to Rank exists for this purpose.

Learning to Rank or machine-learned ranking (MLR) is a branch of machine learning that studies and develops self-learning ranking algorithms. Its main task is to determine the most effective algorithms and approaches based on their qualitative and quantitative assessment. Why did the problem of teaching ranking arise?

For example, let's take a page of an information resource - an article. The user enters a query into a search engine that already contains a set of files. In accordance with the request, the system extracts the corresponding files from the collection, ranks them, and gives them the highest relevance.

Ranking is performed based on the sorting model f (q, d), where q is the user's request, d is the file. The classical f (q, d) model works without self-learning and doesn't consider the connection between words (for example, Okapi BM25, Vector space model, BIR models). It calculates the file's relevance to the request based on the occurrence of the request words in each document. Obviously, with the current volume of files on the Internet, search results based on simple models may not be accurate enough.
The model calculates the relevance of a file to a query based on the occurrence of query words in each document. Under ideal conditions, when we analyze documents written to solve a given problem, such algorithms do their job well, which has worked successfully before.

However, such algorithms have one significant drawback - the initial data must be strictly subordinate to the rule, and the rule must strictly follow the author's task.

Suppose we set ourselves the task of manipulating the results of the algorithm's work. In that case, we can easily solve it since the algorithm's specifics initially assume that they will not be manipulated.

So the problem arose when the search began to monetize. As soon as it began to monetize, it stimulated not only to submit documents for analysis but to present them in such a way as to get preferences over competitors. Therefore, today search results based on simple models cannot have a sufficient level of accuracy.
Demi Murych, Reverse Engineering and Technical SEO Specialist
Therefore, the trends have changed. Machine learning has replaced the simple classical model to improve the quality of search. The use of machine learning methods made it possible to build a ranking model automatically. It considers many relevancy factors that previously could not be considered — for example, anchor texts, page authority, natural language analysis, and page user experience.
Learning to Rank is currently one of the critical tasks of modern web search. Over time, the most common metrics for assessing the search quality have already been formed in this area.
How Artificial Intelligence Changes SEO: Asking Experts
How Learning to Rank is implemented
Learning to Rank is a complete learning task that includes training and testing.

Training data includes request, files, the degree of relevance.
Each request is associated with several files. It is not possible to check the relevance of all files, so the merge method is usually used - only a few top documents retrieved by existing ranking models are checked. Also, training data can be obtained automatically from Google SearchWiki or by analyzing the transition log, query chains.

The degree of compliance is determined by the file request in several ways. The most common approach assumes that the relevance of a document is based on several metrics. The higher the correspondence of one indicator, the higher the score for it. Relevance scores are derived from a set of search engine labeling that takes 5 values from 0 (irrelevant) to 5 (completely relevant). The estimates for all indicators are summed up.

As a result, the most relevant is the file, the sum of the ratings for all indicators being the highest. The learning data is used to create ranking algorithms that calculate the relevance of documents to real queries.

However, there is an important nuance here: user requests must be processed at high speed. And it is impossible to use complex scoring schemes in this variant (for each request). Therefore, the check is carried out in two stages:
1
Using simpler algorithms, a sample of a small number of relevant files is formed. This allows you to evaluate requests quickly.
2
This sample is then re-ranked using more complex models (machine learning), since they are more resource-intensive.
Ranking attributes
During the training and operation of MLR, the file-request pair is translated into a numerical vector from ranking factors and other signals. They characterize the relationship between the request and the file, as well as their properties.

Signs are divided into three groups:
1
Static attributes are independent of the request and refer to the file itself. For example, this is its volume or the authority of the page (PageRank). These signs are calculated during the indexing process and can be used for a static assessment of the file quality. This helps speed up the evaluation of the search query.
2
Query attributes are those that depend only on the query. For example, its length or what the request is about.
3
Dynamic attributes are those that depend on the file and the request. For example, the level to which the file matches the request.
This approach allows you to provide the user with the most accurate results on the SERP. Ranking features are collected in LETOR, a collection of benchmarks for research in learning to rank in information retrieval.
For a result to be considered relevant for a given search query, it must (1) provide a satisfactory amount of high-quality content*, (2) in a straightforward and organized manner (3) that addresses the correct or more likely intent/s of the query.

To learn more about content quality, I suggest reading Google's search quality evaluator guidelines.

It's also important that the result has few or no relevance issues. An example of issue occurs when the page loses its helpfulness for the query due to the passage of time. This is very common, for example, for queries with news intent whose results can quickly become stale if they don't include the latest developments of the target story.

I believe that understanding the qualitative aspect of how search relevance works is much more important for marketers than trying to understand the actual metrics and science behind information retrieval systems. To achieve that objective, here is what I suggest:

Get into the habit of putting yourself on the search engine's shoes – and, yes, I know that everybody talks about the user's shoes but I'm trying to offer another perspective. If you were in charge of Search at Google, how would you assess the level of Expertise, Authority and Trust of a given website? How would you determine the characteristics of the most relevant results for a given query? Does your content possess these characteristics for your target queries? Reading Google's guidelines and staying up to date with what's happening in SEO and Search can definitely be of help.
Danilo Godoy, founder of Search Evaluator
Search ranking metrics
Both binary (relevant/irrelevant) and multilevel (for example, relevance 0 to 5) scales are used to evaluate each file returned in response to a request. In practice, queries can be incorrect and have different shades of relevance. For example, in the query "dog" there is an ambiguity: the machine doesn't know if the user is looking for the planet Dog, an album by Blink-182, or the rapper Snoop Dogg.

In information retrieval theory, there are many metrics for assessing the performance of an algorithm with training data and comparing different ranking training algorithms. Open source states that they are created in relevance scoring sessions where judges evaluate search results' quality. However, common sense tells us that such an option is hardly possible, and here's why:
There are no judges in machine learning because even 100.5 million people cannot answer so many times to collect the necessary data pool for a high-quality forecast.
That is why we have systems that can:

Recognize cats: because during the Internet's existence, people have produced billions of ready-made data with cats.

Determine the naturalness of the language because we have digitized many books in this language, and we know for sure that it is natural.

But we cannot assess the relevance of the request to the site because we do not have such data. Even if you just put the assessors to click, they will not cope with the task since relevance is not just a match of the text to the request; it is hundreds of other factors that a person cannot yet assess in a reasonable time.
Demi Murych, Reverse Engineering and Technical SEO Specialist
Nevertheless, in information retrieval theory, there are well-established indicators of assessment described by authoritative sources. Here are some of them.
MRR: Mean Reciprocal Rank
It is a statistical measure of evaluating any process that generates a list of possible responses to a sample of queries ordered by the probability of being correct. The mean reciprocal rank is defined as the average of the inverse ranks for all Q queries:
Where ranki is the position of the first relevant document for query i.

This is the simplest metric of the three: it tries to measure where is the first relevant item. It is closely linked to the binary relevance family of metrics.

This method is simple to compute and is easy to interpret; it focuses on the first relevant element of the list. It is best suited for targeted searches, such as users asking for the "best item for me." Suitable for known-item search such as navigational queries or looking for a fact.

The MRR metric doesn't evaluate the rest of the list of recommended items. It focuses on a single item from the list.

It gives a list with a single relevant item just a much weight as a list with many relevant items. It is okay if that is the target of the evaluation.

This might not be a fair evaluation metric for users that want a list of related items to browse. The goal of the users might be to compare multiple associated items.
MAP: Mean Average Precision
MAP, or mean average precision, computes an entire query set's average relevance and ranks all the top ones highly. So, rather than assuming there is only one winner, MAP asks whether this is relevant and lists all the queries with a "yes" response.

MAP is ideal for ranking results when you are looking at five or more results. That makes it ideal for evaluating related recommendations, like on an eСommerce platform.
DCG (Discounted cumulative gain) and NDCG (Normalized Discounted Cumulative Gain)
If we want to understand the NDCG metric, we must first understand the CG (Cumulative Gain) and DCG (Discounted Cumulative Gain), and also understand the two assumptions we make when using DCG and related metrics:
highly relevant documents are more useful if they appear earlier in the search engine results list (have higher rankings).
highly relevant documents are more useful than non-relevant documents, which in turn are more useful than irrelevant documents.
If each recommendation has a relevance score associated with it, the CG is the sum of the relevance scores of all results in the list:
The cumulative gain at a specific position in the p ranking, where rel_i is an assessment of the relevance of the result at position i. Each relevance score is associated with a document.

The problem with CG is that it doesn't consider the position of the result set when determining its usefulness. In other words, if we change the order of the relevance scores, we won't be able to understand better the usefulness of the result set as the CG will remain the same.

For example:
Metric Set A: [3, 1, 2, 3, 2, 0] CG of Metric Set A: 11
Metric Set B: [3, 3, 2, 2, 1, 0] CG of Metric Set B: 11
Obviously, Metric Set B returns a much more useful result than Metric Set A, but the CG measure says they return equally good results.

To overcome this, we are introducing DCG. DCG punishes highly relevant documents that appear lower in the search results by decreasing the ranked relevance value, which is logarithmically proportional to the position of the result:
But with DCG, a problem arises when we want to compare search engines' performance from one query to another because the list of search results can vary in length depending on the query provided. Therefore, by normalizing the cumulative gain in each position for the selected value by queries, we arrive at NDCG.

We accomplish this by sorting all the relevant documents in the corpus according to their relative relevance, getting the largest possible DCG through the p-position (also known as c).
Where, the ideal discounted cumulative gain is:
Conclusion
The topic of metrics for assessing search quality is relevant and important today. But, unfortunately, we can talk about it only from the point of view of theorizing. Metrics for evaluating the quality of search, in our case, are mostly fantasies. Here's why: we can't have all the data we have on Google.

This means that choosing a methodology that would make it possible to determine the instrument is more of fortune-telling since we cannot check anything here.

Yes, you can try to rely on patents or words published by this or that official. But 90% of all patents are rubbish that has nothing to do with programming. And phrases are only part of the puzzle, which is aggravated by the fact that all these people are tied by such a severe NDA that even the phrases that were supposedly accidentally thrown out are written in the contract.

And all we can do is analyze and combine data from various sources, resulting in a document that tells about the theoretical foundations of how search quality can be assessed.

Good luck and high positions to everyone!
To keep track of all the news from the Serpstat blog, subscribe to our newsletter. And also join our group on Facebook and follow our Twitter.

Learn how to get the most out of Serpstat

Want to get a personal demo, trial period or bunch of successful use cases?

Send a request and our expert will contact you ;)

The opinion of the guest post authors may not coincide with the opinion of the Serpstat editorial staff and specialists.

Rate the article on a five-point scale

The article has already been rated by 8 people on average 5 out of 5
Found an error? Select it and press Ctrl + Enter to tell us

Share this article with your friends

Sign In Free Sign Up

You’ve reached your query limit.

Or email
Forgot password?
Or email
Optional
Back To Login

Don’t worry! Just fill in your email and we’ll send over your password.

Are you sure?

Awesome!

To complete your registration you need to enter your phone number

Back

We sent confirmation code to your phone number

Your phone Resend code Queries left

Something went wrong.

Contact our support team
Or confirm the registration using the Telegram bot Follow this link
Please pick the project to work on

Personal demonstration

Serpstat is all about saving time, and we want to save yours! One of our specialists will contact you and discuss options going forward.

These may include a personal demonstration, a trial period, comprehensive training articles & webinar recordings, and custom advice from a Serpstat specialist. It is our goal to make you feel comfortable while using Serpstat.

Name

Email

Phone

We are glad of your comment
Upgrade your plan

Upgrade your plan

Export is not available for your account. Please upgrade to Lite or higher to get access to the tool. Learn more

Sign Up Free

Спасибо, мы с вами свяжемся в ближайшее время

Invite
View Editing

E-mail
Message
Optional
E-mail
Message
Optional

You have run out of limits

You have reached the limit for the number of created projects. You cannot create new projects unless you increase the limits or delete existing projects.

I want more limits