12 jul 2024

What? A Synonyms API for Elasticsearch?

One often (over) used feature of lexical search is synonyms. If something does not work, we add a synonym. When used correctly, synonyms are a critical feature. Still, they were so annoying to use. I wrote multiple synonym management components, where synonyms were stored in a database and then provided to Elasticsearch as a file or in the index mapping. Little did I know when someone pointed me to the Synonym API from Elasticsearch.

In this blog post, I demo the synonym API, explain its limitations, and show you how cool it is now.

What you need to know about Synonyms

Let us ask our friend “Chat GPT” what a synonym is.

A synonym is a word or phrase that has the same or nearly the same meaning as another word or phrase in the same language. For example, “happy” is a synonym of “joyful.” ~ ChatGPT

The Easticsearch docs describe three things the synonyms allow you to do:

Improve search relevance, find documents with words that have the same meaning as the word you are looking for.
Create a domain-specific language that non-domain experts can still find.
Correct common misspellings or typos to improve the results.

There are multiple ways to describe a synonym rule. Some rules allow replacing each word with the other supplied word(s).

happy, joyfull

Other rules have a clear direction in replacement. So, the words on the left become the words on the right, not the other way around. In the example below, a text containing the word iPod is found with the word i-pod, but not vice versa.

i-pod, i pod => ipod

These are two rules. They can become hard to understand when multiple rules are combined, or synonyms of multiple words are used. Therefore, it is good to be able to play around with the synonyms before you are ready to use them for real.

In Elasticsearch, synonyms are applied to a field using the analyzer. You can use a search analyzer to apply the synonyms to the query. You can also use an index analyzer, where the synonyms are applied while you index your documents. The advantages of index time are speed and query time flexibility. Index time synonyms need a re-index to apply them to the index.

The synonyms API

I cannot write a blog without a demo, so we created a demo in Python to play around with this synonym API. You can find the project here:

https://github.com/jettro/demo-es-synonyms

We create an index with settings and mappings that use a synonym set with synonym rules in the search analyzer. But before we do that, we have to create the synonym set. The next code block creates a synonym set.

class EsClient:

def __init__(self):
  self.es = self._connection()

@staticmethod
def _connection():
  # Connect to the Elasticsearch node
  es = Elasticsearch(hosts=['http://localhost:9200'])

  # Check if the connection is successful
  if es.ping():
    print("Connected to Elasticsearch")
  else:
    print("Could not connect to Elasticsearch")

  return es

def create_synonym_set(self):
  if self.es.synonyms.get_synonym(id=SYNONYM_SET_NAME):
    return "Synonym set already exists"
  return self.es.synonyms.put_synonym(
    id=SYNONYM_SET_NAME, synonyms_set=[]

With the synonym set in place, we can create the index. I show you the settings in the next code block. Notice that we create a filter that uses the synonyms_set we created in the previous step. We also create an analyzer named synonym_anlyzer, which we use in the mappings.

settings = {
  "analysis": {
  "filter": {
    "synonym_filter": {
      "type": "synonym_graph",
      "synonyms_set": SYNONYM_SET_NAME,
      "updateable": True
    }
  },
  "analyzer": {
    "synonym_analyzer": {
      "tokenizer": "standard",
      "filter": ["lowercase", "synonym_filter"]
    }
  }
},
  "number_of_shards": 1,
  "number_of_replicas": 0
}

We create an index with just one field called remark. Below is the mapping configuration of the index.

mappings = {
  "properties": {
    "remarks": {
      "type": "text",
      "analyzer": "standard",
      "search_analyzer": "synonym_analyzer"
    }
  }
}

Interacting with synonym rules using the API and the Python client is easy. Check the next code block.

def add_synonym_rule(self, rule_id, synonym):
  return self.es.synonyms.put_synonym_rule(
    set_id=SYNONYM_SET_NAME, 
    rule_id=rule_id, 
    synonyms=synonym)

def get_synonym_rules(self):
  return self.es.synonyms.get_synonym(id=SYNONYM_SET_NAME)

def delete_synonym_rule(self, rule_id):
  return self.es.synonyms.delete_synonym_rule(
    set_id=SYNONYM_SET_NAME, 
    rule_id=rule_id)

Notice that a synonym set has an id. With that ID, we can add synonym rules. Each synonym rule also had an ID. You can delete a rule with the ID.

Working with synonyms

We index the following text:

After eating that enormous burrito, I felt so stuffed, packed, crammed, and jam-packed that I considered moving to a new zip code just to accommodate my bloated belly! ~ ChatGPT

First, I want to add a rule to accommodate a typo.

client.add_synonym_rule("burrito", "burito => burrito")

Next, I want to use other words than enormous.

client.add_synonym_rule("enormous", 
                        "huge, gigantic, colossal, enormous")

The following query should now return a response:

print(client.search_remark("huge burito"))

{
  'took': 7, 
  'timed_out': False, 
  '_shards': {
    'total': 1, 
    'successful': 1, 
    'skipped': 0, 
    'failed': 0}, 
  'hits': {
    'total': {
      'value': 1, 
      'relation': 'eq'
    }, 
    'max_score': 0.5753642, 
    'hits': [
      {
        '_index': 'remarks', 
        '_id': 'KSn8opABu_eifJwgH4rK', 
        '_score': 0.5753642, 
        '_source': {
          'remarks': 'After eating that enormous burrito, I 
          felt so stuffed, packed, crammed, and jam-packed 
          that I considered moving to a new zip code 
          just to accommodate my bloated belly!'
        }
      }]
    }
  }
}

You can run the demo yourself, at the top of the article is a screenshot of the demo.

Limitations

Each synonym set can have a maximum of 10,000 synonyms. I’m not sure if that is a limitation. If you need more than 10,000, I wonder if you are not misusing it.

The API is not available in versions before 8.10

References

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/synonyms-apis.html

Want to know more about what we do?

We are your dedicated partner. Reach out to us.

Contact