Write a custom Lucene.net TokenFilter

In an e-commerce, one of the most important features is the search. Not only from the UX perspective (you can find many articles on the net), but also from the results given back to the users. They must be able to find quickly what they are looking for, so, implementing some tolerance in the words that they could write wrong or singulars and plurals generating the same results, is important in my opinion. In this project, I used RavenDB for the read model (everything is based on CQRS+ES) that include the powerful text search engine Lucene. Lucene alone does a lot of work, but the .Net port is a bit behind the original Java version (at least at the moment I am writing this article) and I had to write/convert from java the English, Italian and Spanish implementation. I also did a little tweak to the French one before using it with RavenDB. Other than that, I was not able to solve a problem that revealed itself crucial.

The problem with brand names

When the beta phase of this e-commerce started, what emerged immediately was that majority of users, when looking for something, started by the brand name. The problem was that only a minority of them were writing the brand name correctly.

For example, for the brand K-Force, we saw searches for kforce, k force, just force and rarely k-force

This caused erratic results due to how Lucene index words. In fact, none of them returned results, except for k force and force. That’s because the StandardTokenizer tokenize only k and force.

Obviously it was not acceptable that k-force was not indexed at all.

What we wanted instead, was to get the same results for all the combination of searches showed before.

Implementing the snowball analyzer

As a first attempt, I tried implementing the Snowball analyzer creating one for English, Spanish, French and Italian languages.

The implementation is straightforward, just create a new class project, import the Lucene.net Nuget package and create a class like the following:

using Lucene.Net.Analysis.Snowball;
using Lucene.Net.Util;

namespace Evoltel.Lucene.Net.Analyzers.Snowball
{
  public class EnglishSnowballAnalyzer : SnowballAnalyzer
  {
    public EnglishSnowballAnalyzer()
      : base(Version.LUCENE_30, "English")
    {

    }
  }
}

Once you have compiled the project, copy the necessary DLLs in the Analyzers folder of RavenDB installation, restart the service and you are ready to go. Also include the DLLs in your RavenDB project and you can create indexes based on your Analyzer in your application or website.

Unfortunately the results were far from acceptable, because the snowball analyzer, a stemming analyzer based on the project snowball.tartarus.org, revealed itself to be too much aggressive for our needs.

For example, in case of the Italian language, many words were missing while doing an incremental search. Looking for “pantaloncini” (shorts in English) returns results till the users were writing “pantalon”, but when reaching “pantalonc” the results were suddenly empty.

The plurals and singulars were handled almost perfectly, but as you can imagine, having an incremental search that suddenly stop returning results was a big problem. Moreover, the problem with brand names was not solved.

Implementing a custom tokenizer

After some research and some hours of trial and error, I was able to write a custom tokenizer from scratch and inserting it inside my custom language analyzers.

In combination with the correct language analyzer, the search will be able to returns consistent results for brand names, singular and plurals.

To obtain this, every analyzers must have my new tokenizer in the chain of the TokenStream. Here, as an example, my EnglishAnalyzer:

    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {      
      TokenStream result = new WhitespaceTokenizer(@reader);
      result = new StandardFilter(result);
      result = new EnglishPossessiveFilter(result);
      result = new LowerCaseFilter(result);
      //My use case wants all the words, so no skipping of stopwords
      //result = new StopFilter(StopFilter.GetEnablePositionIncrementsVersionDefault(matchVersion), result, stopTable);
      result = new PorterStemFilter(result);
      //Create multiple tokens and removing symbols
      result = new SymbolsFilter(result);
      return result;
    }

The other analyzers are the same, except for specific language steps. One worth mentioning is the FrenchAnalyzer; I had to recompile it, because its standard behavior, for our use case, was problematic. The accents were indexed, meaning that if a user made a search without using the correct accents, no results were shown.

To solve this, I edited the sequence by inserting the ASCIIFoldingFilter before my SymbolsFilter, like the following:

    public override TokenStream TokenStream(String fieldName, TextReader reader)
    {
      TokenStream result = new WhitespaceTokenizer(@reader);
      result = new StandardFilter(result);
      result = new FrenchStemFilter(result, excltable);
      result = new LowerCaseFilter(result);
      //Remove accents
      result = new ASCIIFoldingFilter(result);
      result = new SymbolsFilter(result);
      return result;
    }

Creating the TokenFilter

Now that we have an idea of what I wanted to achieve, let us see how to do it. The first step is to create a new TokenFilter that, for every word, will create one or more tokens.

Using k-force as our example, this filter will generate tokens for kforce, k and force, thus allowing the correct results even if the user write the wrong name. The WhitespaceTokenizer will create the token for k-force, as stated before, the StandardTokenizer is not suitable, because it create just k and force tokens.

In the constructor of our new class, we will initialize the ITermAttribute that will hold our tokens, and the PositionIncrementAttribute that will track the position of token inside the TokenStream. To instantiate our term attribute, we will use the method AddAttribute.

    public class SymbolsFilter : TokenFilter
    {
      private readonly ITermAttribute termAtt;
      private readonly IPositionIncrementAttribute posAtt;

      public SymbolsFilter(TokenStream input)
        : base(input)
      {
        termAtt = AddAttribute<ITermAttribute>();
        posAtt = AddAttribute<IPositionIncrementAttribute>();
      }
    }  

What we need to do know is to override the IncrementToken method. This method tell us if there are other token to process or not.

As you can read in the documentation, to preserve the state, we need to call CaptureState to create a copy of the actual attribute.

We need also to maintain a queue with the tokens that we will create by analyzing every input token. So just create this two global variable:

  private State currentState;
  private Queue<string> splittedQueue = new Queue<string>();

Now we are ready to override our IncrementToken. We need to analyze the token and, if it is the first time we are processing it, fill our list of tokens. In my case, I need to split it if there is one or more ‘-‘ in it, nothing more.

Our method can be splitted in four phases. The first one is to check if there are tokens in the queue and, if yes, remove it, add it to terms list and call RestoreState. We need to restore the state and set PositionIncrement to zero, because the position of the TokenStream should not advance until we are done with our queue.

    public override bool IncrementToken()
    {      
      if (splittedQueue.Count > 0)
      {
        string splitted = splittedQueue.Dequeue();
        RestoreState(currentState);
        termAtt.SetTermBuffer(splitted);
        posAtt.PositionIncrement = 0;
        return true;
      }
    }

Second step is to check if there are more token to process, otherwise we exit.

      if (!input.IncrementToken())
        return false;  

The third part consist in verify if it is the first time we are processing a term and, in case, populate our queue by adding the resulting tokens. Our function will return at least the token we are processing obviously.

    var currentTerm = new string(termAtt.TermBuffer(), 0, termAtt.TermLength());
    if (!string.IsNullOrEmpty(currentTerm))
    {
      var splittedWords = GetSplittedWord(currentTerm);
      //If there are no words, we exit and go to the next token
      if (splittedWords == null || splittedWords.Length == 0)
        return true;
      foreach (var splittedWord in splittedWords)
        splittedQueue.Enqueue(splittedWord);
    }

Fourth and last step is to track the current state in a variable using CaptureState

    currentState = CaptureState();
    return true;

That is all, now you should have an understanding of how a TokenFilter works. Also, as a further extension, instead of using a function to process the token, you could inject in the constructor a service that do some more advance processing and return the filled queue.

  • John Smith

    i have spent days trying to figure out how to use the tokenizer and tokenfilter and i am yet to find anything useful. nobody explains absolutely anything. docs on lucene are worthless, every blog post about this is worthless.

    • Le Zuero

      I agree. I never could understanding only reading the lucene docs.