How to index mixed-language content with Lunr.js

August 9, 2015, revised April 11, 2017 Lunr.js search JavaScript

Lunr.js is a Javascript full-text search engine that is typically used on static websites. An additional library called lunr-languages provides stemmers and stopwords for different languages.

Note: the method described in this article is now included in the lunr-languages library: indexing multi-language content.

(Stemmers are the part of the engine that reduces words to a “base” form, removing declensions, suffixes and so forth, like “reduces” and “reducing” becomes “reduce”. Stopwords are words that are so common that they don’t add value to the index, and only add bloat, like “and” or “a” or “the” in English.)

However, the way lunr-languages is implemented suggests a single-language content set, which my blog is not - I have articles in both English and Russian. What happens is: when you enable, for example, the Russian stemmer, the English stemmer is completely disabled and English texts are improperly indexed.

The good news is, you can reconfigure Lunr to properly index two languages - or more. The bad news is, if your other language uses the Latin alphabet, the stemmers might - in theory - get confused and produce bad results. (I don’t think that this is a problem, though - I’d love to see a real example.)

Anyways, for non-Latin languages you will have no problems whatsoever.

Implementation

Lunr processes source text through a pipeline of functions. The default pipeline looks like this:

index.searchPipeline.reset();
index.searchPipeline.add(
   lunr.trimmer,
   lunr.stopWordFilter,
   lunr.stemmer
);

If you use the Russian language plugin, it is reset to look like this:

index.searchPipeline.reset();
index.searchPipeline.add(
   lunr.stopWordFilterRu,
   lunr.stemmerRu
);

Let’s make them work together.

Trimming tokens correctly

Trimming is removing punctuation from the beginning and end of words.

The original trimmer looks like this:

function trimmer(token) {
  return token
    .replace(/^\W+/, '')
    .replace(/\W+$/, '');
};

As you can see, it trims everything that’s not a regex word character. And regex word characters in Javascript (so far) only include the Latin alphabet. So in fact, by default Russian words are completely wiped out while indexing:

trimmer("(hedgehog,") === "hedgehog"
trimmer("(ёжик,") === ""

So that’s why in the Russian pipeline the trimmer is absent. But we’re going to do better and write our own trimmer, with both English and Russian letters.

function trimmerEnRu(token) {
  return token
    .replace(/^[^\wа-яёА-ЯЁ]+/, '')
    .replace(/[^\wа-яёА-ЯЁ]+$/, '');
};

I’d like to point out that the characters in this regex are, in fact, Unicode, and the ranges are properly processed by the regex engine. It’s just that the character classes don’t know about them. (Aso, if you’re familiar with the Cyrillic Unicode points, you’ll know that the letter “Ë” is not contained in the main alphabet range and has to be mentioned separately.)

trimmerEnRu("(hedgehog,") === "hedgehog"
trimmerEnRu("(ёжик,") === "ёжик"

Better! By the way, on my data set, adding a Russian trimmer reduces the index size by 27%.

Unifying the stopword filters

The stopword filter is a simple dictionary-based direct matching filter. The implementations are exactly the same, they just use different dictionaries, so we only need to combine them:

lunr.stopWordFilter.stopWords =
  lunr.stopWordFilter.stopWords.union(
    lunr.ru.stopWordFilter.stopWords);

By the way, you can add more custom stop words to reduce your own data set.

Calling all stemmers

As both English and Russian stemmers will only process words of the corresponding language, I simply used both.

Bringing it all together

Make sure you declare and register the trimmer in both your index and your search script, and patch the stopwords dictionary, too:

var trimmerEnRu = function (token) {
  return token
    .replace(/^[^\wа-яёА-ЯЁ]+/, '')
    .replace(/[^\wа-яёА-ЯЁ]+$/, '');
};

lunr.Pipeline.registerFunction(trimmerEnRu, 'trimmer-enru');

lunr.stopWordFilter.stopWords =
  lunr.stopWordFilter.stopWords.union(
    lunr.ru.stopWordFilter.stopWords);

Now, in the index script, you build the pipeline thusly:

var index = lunr(function () {
  // DON'T do this as the manual suggests
  // this.use(lunr.ru);
  this.pipeline.reset();
  this.pipeline.add(
    trimmerEnRu,
    lunr.stopWordFilter,
    lunr.stemmer,
    lunr.ru.stemmer
}

And in the search script the pipeline is recreated automatically, so don’t worry, all the custom functions will be there:

var index = lunr.Index.load(serializedIndex);

Try it out

By the way, all of this is implemented in my site-wide search - go try it out on top of the page.

Buy Me a Coffee at ko-fi.com