2

I am making a query A on elastic search and get the first 50 results. I also make a query B which contains the 30% of the terms of the query A. Each result of query A has a similarity score scoreA and each result of B has scoreB. What I am trying to achieve is combine the results of A and B to improve the Mean Average Precision of each imdividual query. One way that I found is to reorder the results based on this formula:

SIMnew = *scoreA + (1-)*scoreB

where is a hyperparameter which I should tune. I noticed that this formula is very similar to Jelineck-Mercer smoothing which is implemented in Elastic Search (https://www.elastic.co/blog/language-models-in-elasticsearch).

Is there any default way to do this reordering with Elastic Search or the only way is a custom implementation?

(Given that I searched a lot about this formula and didn't find something usefull, it would be great if somenone gave me an intuition of how and why this works)

  • May I ask you how exactly do you compute the scores scoreA and scoreB? Similarity score to which you refer is confusing to me, ES has got a notion of relevance score, is it what you are referring to? Or these scores are something external? Are they computed in the query or are stored inside the documents? Thank you. – Nikolay Vasiliev Jun 7 at 17:08
  • I refer to the similarity score, elastic search uses between a query and a document (elastic.co/guide/en/elasticsearch/reference/current/…). In this case I use the default similarity, BM25. – user11559048 Jun 8 at 9:56
1

Combination of results of different queries in Elasticsearch is commonly achieved with bool query. Changes in the way they are combined can be made using function_score query.

In case you need to combine different per-field scoring functions (also known as similarity), to, for instance, do the same query with BM25 and DFR and combine their results, indexing the same field several times with use of fields can help.

Now let me explain how this thing works.

Find official website of David Gilmour

Let's imagine we have an index with following mapping and example documents:

PUT mysim
{
  "mappings": {
    "_doc": {
      "properties": {
        "url": {
          "type": "keyword"
        },
        "title": {
          "type": "text"
        },
        "abstract": {
          "type": "text"
        }
      }
    }
  }
}

PUT mysim/_doc/1
{
  "url": "https://en.wikipedia.org/wiki/David_Bowie",
  "title": "David Bowie - Wikipedia",
  "abstract": "David Robert Jones (8 January 1947 C 10 January 2016), known professionally as David Bowie was an English singer-songwriter and actor. He was a leading ..."
}

PUT mysim/_doc/2
{
  "url": "https://www.davidbowie.com/",
  "title": "David Bowie | The official website of David Bowie | Out Now ...",
  "abstract": "David Bowie | The official website of David Bowie | Out Now Glastonbury 2000."
}

PUT mysim/_doc/3
{
  "url": "https://www.youtube.com/channel/UC8YgWcDKi1rLbQ1OtrOHeDw",
  "title": "David Bowie - YouTube",
  "abstract": "This is the official David Bowie channel. Features official music videos and live videos from throughout David's career, including Space Oddity, Changes, Ash..."
}

PUT mysim/_doc/4
{
  "url": "www.davidgilmour.com/",
  "title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",
  "abstract": "David Gilmour is a guitarist and vocalist with British rock band Pink Floyd, and was voted No. 1 in Fender's Greatest Players poll in the February 2006 Guitarist ..."
}

Practically speaking, we have an official website of David Gilmour, that one of David Bowie, and two other pages about David Bowie.

Let's try to search for David Gilmour's official website:

POST mysim/_search
{
  "query": {
    "match": {
      "abstract": "david gilmour official"
    }  
  }
}

On my machine this returns the following results:

    "hits": [
...
        "_score": 1.111233,
        "_source": {
          "title": "David Bowie | The official website of David Bowie | Out Now ...",
...
        "_score": 0.752356,
        "_source": {
          "title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",
...
        "_score": 0.68324494,
        "_source": {
          "title": "David Bowie - YouTube",
...

For some reason, David Gilmour's page is not the first one.

If we take 30% of terms from the first query, like the original post is asking (let's cunningly select gilmour to make our example shine), we should see an improvement:

POST mysim/_search
{
  "query": {
    "match": {
      "abstract": "gilmour"
    }  
  }
}

Now Elasticsearch only returns one hit:

    "hits": [
...
        "_score": 0.5956734,
        "_source": {
          "title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",

Let's say we don't want to discard all other results, just want to reorder so the David Gilmour's website is higher in the results. What can we do?

Use simple bool query

The purpose of bool query is to combine results of several queries in OR, AND or NOT fashion. In our case we could go with OR:

POST mysim/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "abstract": "david gilmour official"
          }
        },
        {
          "match": {
            "abstract": "gilmour"
          }
        }
      ]
    }
  }
}

This seems to do the job (on my machine):

    "hits": [
...
        "_score": 1.3480294,
        "_source": {
          "title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",
...
        "_score": 1.111233,
        "_source": {
          "title": "David Bowie | The official website of David Bowie | Out Now ...",
...
        "_score": 0.68324494,
        "_source": {
          "title": "David Bowie - YouTube",
...

What bool query does under the hood is simply summing the scores per each subquery. In this case the top hit's score 1.3480294 is a sum of the document's score against two stand-alone queries we did above:

>>> 0.752356 + 0.5956734
1.3480294000000002

But this might not be good enough. What if we want to combine these scores with different coefficients?

Combine queries with different coefficients

To achieve this we can use function_score query.

POST mysim/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "function_score": {
            "query": {
              "match": {
                "abstract": "david gilmour official"
              }
            },
            "boost": 0.8
          }
        },
        {
          "function_score": {
            "query": {
              "match": {
                "abstract": "gilmour"
              }
            },
            "boost": 0.2
          }
        }
      ]
    }
  }
}

Here we implement the formula from the original post with = 0.8.

    "hits": [
...
        "_score": 0.8889864,
        "_source": {
          "title": "David Bowie | The official website of David Bowie | Out Now ...",
...
        "_score": 0.7210195,
        "_source": {
          "title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",
...

On my machine this still produces "wrong" ordering.

But changing to 0.4 seems to do the job! Hooray!

What if I want to combine different similarities?

In case you need to go deeper, and be able to modify how Elasticsearch computes relevance per-field (which is called similarity), it can be done via defining a custom scoring model.

In a case which I can hardly imagine, you may want to combine, say, BM25 and DFR scoring. Elasticsearch only permits one scoring model per field, but it also allows to analyze the same field several times via multi fields.

The mapping might look like this:

PUT mysim
{
  "mappings": {
    "_doc": {
      "properties": {
        "url": {
          "type": "keyword"
        },
        "title": {
          "type": "text"
        },
        "abstract": {
          "type": "text",
          "similarity": "BM25",
          "fields": {
            "dfr": {
              "type": "text",
              "similarity": "my_similarity"
            }
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "similarity": {
        "my_similarity": {
          "type": "DFR",
          "basic_model": "g",
          "after_effect": "l",
          "normalization": "h2",
          "normalization.h2.c": "3.0"
        }
      }
    }
  }
}

Notice that here we defined a new similarity called my_similarity which effectively computes DFR (example taken from the documentation).

Now we will be able to do a bool query with a combination of similarities in the following way:

POST mysim/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "abstract": "david gilmour official"
          }
        },
        {
          "match": {
            "abstract.dfr": "david gilmour official"
          }
        }
      ]
    }
  }
}

Notice that we do the same query to two different fields. Here abstract.dfr is a "virtual" field with scoring model set to DFR.

What else should I consider?

In Elasticsearch scores are computed per-shard, which can lead to unexpected results. For example, IDF is computed not on the whole index, but only on the subset of documents that are in the same shard.

Here you can read how Lucene, Elasticsearch's backbone, computes relevance scores.


Hope that helps!

  • 1
    Thanks for the detailed answer. It surely helped a lot. – user11559048 Jun 10 at 9:09

Your Answer

By clicking Post Your Answer, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.