ContentSearch

MIT Proves that Sitecore Search is Optimal by Derek Hunziker

The title of this post is a bit of a stretch. Ok, ok – so maybe it's a shameless attempt to get your attention.  Nevertheless, there is a shred of truth to it, and it did get your attention, right? Allow me to explain.

Sometimes, old ways are the best

Yesterday, researches from MIT announced that, in all likelihood, the basic algorithm for determining similarity between two strings is pretty much as good as it will ever get. The algorithm, originally developed in 1965 by Vladimir Levenshtein, is commonly referred to as the Levenshtein edit-distance.

Levenshtein distance is obtained by finding the cheapest way to transform one string into another. In simplest terms, it is a count of how many edits (i.e. insertions, deletions or substitutions) are needed to get the job done. For example, the distance between “elude” and “allude” is 2.

  1. elude -> aelude (insertion of “a”)
  2. aelude -> allude (substitution of “e” for “l”)

Calculating edit-distance is used in many real world applications, such as comparing human genomes, natural language processing, and to correct spelling errors. There are also many implementations of the algorithm with varying degrees of efficiency, such as the Wagner-Fischer and Damerau-Levenshtein variants, however, the underlying concept has remained the same for over 40 years.

Are things getting fuzzy?

So how does this relate to Sitecore, you ask? The answer is in Sitecore’s ability to perform “fuzzy” content searches. A fuzzy search allows some degree of mistakes to be made in the search term, while still returning the intended results. This is especially handy in the context of online search applications, where mistakes are frequently made. Take, for example, a misspelled search for “knowlege base”. Under the hood, Sitecore’s default search provider (Lucene.Net) will use Levenshtein distance to find matches within a minimum level of similarity of your choosing. In the example below, I’ve specified a minimum of 0.7, where 1.0 is an exact match, and 0.0 is farthest from equal.

var index = ContentSearchManager.GetIndex("sitecore_web_index");
using (var context = index.CreateSearchContext())
{
    var result = context.GetQueryable<SearchResultItem>()
        .Where(sri => sri.Name.Like("Knowlege base", 0.7f))
        .FirstOrDefault();
        
    Console.WriteLine(result);
}

Something’s fishy about all this

One of my recent projects was implementing search for SeafoodWatch.org. In retrospect, it was one of the most fulfilling projects I’ve ever worked on. The Monterey Bay Aquarium’s SeafoodWatch program is not only a great organization, with great people, and a great cause, but they care deeply about your ability to find and learn about sustainable seafood.

At the start of the project, we reviewed the top searches from their existing website and found that many of the search terms were, in fact, misspelled. It was therefore crucial that we return the best possible matches to our visitors whenever possible – not just for exact matches.

This was all made possible by a 40-year-old concept that still holds up to this day. Have you used Fuzzy search in any of your projects? If so, I’d love to hear about it in the comments. Cheers!