• Home
  • About
    • Rotendahl photo

      Rotendahl

      A site containing teaching material, code projects, and rants

    • Resume
    • Email
    • Twitter
    • LinkedIn
    • Github
  • Posts
    • All Posts
    • All Tags
  • Projects

Metrics in Information retrieval

08 Nov 2019

Reading time ~8 minutes

Information Retrieval

Information Retrieval is a field of computer science that, as the name suggests, is about returning information from a corpus based on context. The context can include a query, the user location, time, etc. The result can either be members of the corpus or some new document that is a combination of information in the corpus. This post uses the problem of ad-hoc retrieval to cover different metrics that can be used in a variety of classification tasks.

Ad-Hoc retrieval

Example of Ad-Hoc retrieval: We present a query to the IR system which then retrieves documents from the corpus related to the query.

Ad hoc retrieval is the task of taking some query and returning a ranked list of relevant documents. The steps of this process is first to convert the query to some features based on the text of the query and perhaps also the location, time, date, etc. of the query. We must then rank the documents according to the query, this ranking can be influenced both the contents of the document and it’s metadata for instance how popular the document is or how recent it is.

The purpose of this post is not to describe how to design such a system, that post can be found here: Implementing ad-hoc retrieval, but instead to describe how such a system should perform.

In general I find it helpful to reason about properties and metrics used for a problem before diving into the implementation.

A good IR system

How do we construct a good IR system? Below we list a series of properties that are desirable for an IR system.

  • Semantic Understanding: The ability to understand the meaning and connections between terms. If we search for:”Data-science programming languages” we expect results to contain python and R although they are not explicitly mentioned in the query.

  • Robustness to rare inputs: Some queries contain terms that might have only been seen a few times or perhaps never before. The systems must be able to handle this. Either by focusing on the terms in the query it has seen before or use the meta data.

  • Robustness to corpus variance: The underlying corpus can change over time. For instance the answer to “Who is the president of the US” changes every 4 to 8 years.

  • Robustness to variable length inputs: The length of documents in the corpus can vary, a good IR systems handles this in such a way that document length alone doesn’t influence relevance. For instance if we have a 5000 and 100 word document that mentions word x, 10 and 5 times respectively how do we judge relevance? If we only count word occurrences the first document is chosen. If we normalize for document length we get the second document.

  • Robustness to input errors: The system must be able to handle spelling errors, missing words such as “the, a” etc.

Metrics of information systems.

The properties above while desirable are hard to quantify, our goal is to find some numeric metrics where we can compare systems. Let’s formulate the problem explicitly:

We are given a corpus of documents we call this set , from which we construct a model. This model takes a query and must then return a ranked list of documents we call this list .

What does it mean to perform well in the above definition? If we use to denote a relevant document and to denote an irrelevant one we can look at two rankings:

Precision & Recall

Rank 1 has 3 relevant rankings while rank 2 has 1. Using the sum of relevant rankings seams a good starting point to measure performance. To make it a proper metric we would like to have a general formula and be in a known interval such as . We can achieve this by normalizing the sum by dividing by the ideal solution, which is equal to the length of the list since it only countians relevant documents. This metric is called the precision and for the query it’s formulated as:

The function is new, it is a binary indicator function that returns if the document is relevant to the query and otherwise.

We now have our first metric, before we jump in and code a system we should take a step back and see if our metric can be abused. If I’m confident that one document is relevant to the query I can get a precision score of 1 if I only return the document. This is equivalent to google when given the query “Berlin” returning only the wikipedia article for Berlin. The result has a perfect score according to our metric, but it does not feel correct.

We also want our system to be rewarded for returning multiple relevant documents. As with the precision we take the ratio between relevant documents in the ranking and relevant documents in the corpus. For our example with the two systems above, if we say that the corpus had 4 relevant documents for the query. The ratio is Rank1: and Rank2: . This metric is called the recall and can be formulated as:

The recall of a system can also be gamed, when given a query we need not read it instead we return the entire corpus as our ranking. We know that every relevant document is in it and thus have perfect recall.

F1 scores

As seen by their formulas and flaws precision and recall are two sides of the same coin. On their own they can’t guarantee to accurately measure the performance of a system. But in combination the two metrics gives a good indicator, if both the recall and precision is close to 1, we know that most of our ranking contained relevant documents, and that most of the relevant documents in the corpus was in the ranking.

The F1 score takes the harmonic mean of precision and recall which is computed as:

The F1 score is not as easy to interpret as precision or recall, but it matches the actual performance of the system better.

Order based metrics.

Reciprocal Ranking

Precision and recall do not take the order of the ranking into account. The ranking has a recall and precision of . If we reverse the ranking the scores do not change. This seams wrong. We would like our metric to reward placing relevant documents first in the ranking. We let be the placement of a document in the ranking. means that it is the first document in the ranking. We compute the so called reciprocal ranking (RR) for the query as:

The RR scores for the two rankings and are for the first case and for the second.

While it does add extra information to our two other metrics has the problem that the two rankings and still has the same score although the second is better. We improve our once metric again.

Average precision

We want each relvant document to contribute to the score in a way proportional to its’ place in the ranking. We turn back to our first metric, the precision, if we compute at each step in the ranking and average them out we get our desired effect. We compute the average precision (AP) for the query by:

We let be the first elements of the ranking of . The formula is becoming a bit more complex and more cumbersome to write. In the following example we omit the cases where since they won’t contribute to the sum. Given the two ranks below we compute their average precision.

Thus we have a metric that takes the order of entire ranking into account.

Average based metrics:

In the above examples we only took a single query into account, which doesn’t represent the system’s general performance. To get the performance of an entire system we should take the mean of the metric over many queries. The mean average precision or the MAP of a system is a popular metric for when a system had a binary relevance function.

Non binary relevance functions

Modeling a ranking as a series of ’s’ and ’s has its’ limits, we can model that relevance on a scale. Let’s say that instead of getting a binary relevance function we instead get a score in some interval $[0, n]$ where $0$ indicates not relevant and indicates most relevant we can use the discounted cumulative gain (DCG):

Let’s do an example again and compare two ranks on the scale of , to .

We see that the first solution which put the higher rank function first got the higher score. But the result is no longer a metric between and . To achieve this we divide by the ideal (IDCG), in our case the ideal score would be

We divide our results by the values and get that and

Conclusion

We now know how to compare and measure the performance of different Ad-hoc retrieval models. This understanding should make it easier to understand the models since we now understand the metrics that they will be subject to.

The inspiration for this post came from the book Neural Information Retrieval, check it out if you want to read more.



Information Retrievaldata science Share Tweet