Adventtipp 20 – Sitecore & Lucene search auto-complete using NGram

sitecore_christmas_day-20
Dieser Artikel beschäftigt sich mit der Frage, wie man eine saubere auto-complete Suche mit der Sitecore.ContentSearch in Verbindung mit Lucene implementieren kann. Dabei wird auf n-grams zurückgegriffen, welche eine gute Performance gewährleisten. Da es hierzu kaum hilfreiches Material im Internet gibt, ist der folgende Inhalt auf Englisch verfasst. Let’s get started.

This article is for all of you who are stuck with Lucene and want to implement a proper auto-complete search with Sitecore. A quick note before we start: I highly recommend to use Solr as search engine of choice nonetheless. It makes life much more easier, especially in terms of maintainability of your application and the additional features it brings to the table. Ehab ElGindy wrote a nice and simple tutorial on how to implement an auto-complete search with Sitecore and Solr. Most of the following content is based on his writing, so please check it out.

As you possibly all know, wildcard queries are evil. This applies to good old sql as well as Lucene and Solr (which is based on Lucene). Besides the fact that you might get odd search results, the much more important part of this message is that, based on how your index looks like, you will experience almost always a performance penalty. This is where n-grams come into play which let you perform wildcard searches but on a much better speed. For more information on how n-grams work, please follow this article. By using Solr you can define a new field type and set your preferred tokenizers and filters as you want in the schema.xml. As you might have noticed in Ehab ElGindy’s article, you always define an analyzer which is executed on indexing and one that is performed on querying. In most cases both analyzers use the same tokenizers and filters. However, in our case we only want to apply n-gramming on indexing. If you would perform Ehab ElGindy’s search by using the term „sitecore rocks“ as your query parameter, Lucene would perform the following query:

smartsearch_ac:sitecore smartsearch_ac:rocks

Imagine we would enable n-gramming on querying in the same circumstances, the resulting Lucene query would look like this:

smartsearch_ac:"sit ite site tec itec sitec eco teco iteco siteco cor ecor tecor itecor sitecor ore core ecore tecore itecore sitecore roc ock rock cks ocks rocks"

This is essentially why we never want to use n-gramming while querying. The problem is that, by using Lucene as our search provider, we don’t have a nice schema.xml where we can configure our indexing and querying behaviour as we want. Sitecore.Content has quite decent configuration options, but we are only able to assign one analyzer per field. There is no difference between indexing and querying. Therefore we need to use some tricks in order to achieve our goal.

1. Create your n-gram analyzer
We need to create an analyzer which builds those n-grams. Do not use the Sitecore.ContentSearch.LuceneProvider.Analyzers.NGramAnalyzer for the purpose of auto-complete search as this creates shingles which are needed in other use cases. Please see the following article if you are interested in that.

2. Use a Computed field to create a copy of the title using the new data type
Our computed field is mostly the same as Ehab ElGindy’s one. I updated it for Sitecore 8.x. This step is somewhat optional. You can also reference a Sitecore field in step 3 directly.

<fields hint="raw:AddComputedIndexField">
    <field fieldName="smartsearch">YourNamespace.AutoCompleteTitle, YourAssembly</field>
</fields>

Add your computed field to the Sitecore configuration.

3. Configure your n-gram analyzer

<fieldNames hint="raw:AddFieldByFieldName">
    <field fieldName="smartsearch" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider">
        <analyzer type="YourNamespace.EdgeNgramAnalyzer, YourAssembly" />
    </field>
</fieldNames>

Add a field with the same name as your computed field to the AddFieldByFieldName-section in the configuration file and assign our EdgeNgramAnalyzer to that field.

4. Prevent n-gramming in your search
Up to this point we used well known features of Sitecore.ContentSearch and the indexing part is all done. Now, we need to prevent n-gramming in our searches. Therefore we are going to do some work with the search provider itself. Back when Sitecore 7 was released, Sitecore talked about how you can get access to their search provider. It is a powerful way to enhance the capabilities of Sitecore.ContentSearch. I took this approach and created a generic extension method which you can use in your search code.

The interesting thing we need to examine here is the parameter searchBehaviour. It lets you pass a delegate which takes in a Lucene query object. This object gets created by Sitecore’s search provider and you have the chance here to extend or alter it by using native Lucene API code. Let’s see in the next step how such a delegate might look like.

5. Write the search queryThe Search method looks like ordinary code we would write to implement a simple search. However, we use our extension method from step 4 here instead of the parameter-less GetResults method. In the CreateQuery method we implement a function which creates a query for our computed field „smartsearch“. Notice that we use the StandardAnalyzer here in order to tokenize multi-word strings. Afterwards we add our query to the one that was created by Sitecore’s search provider. In practice you want to use the standard Linq methods in combination with the PredicateBuilder for non-ngram-fields and the extension method’s function parameter for ngram-fields.

Hinterlasse eine Antwort

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind markiert *

*

*

Du kannst folgende HTML-Tags benutzen: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>