Search on

Users of may have noticed the
addition of a few search-related features over the past several months. I’d
like to highlight some of the additions that have been made and show how you
can implement similar functionality on your sites. All of djangosnippet’s
search leans on Apache Solr, a powerful search
engine built on top of Apache Lucene. Haystack is the search
solution for Django apps – it provides a querying interface similar to Django’s
ORM, handles indexing your models for you, and supports advanced features like
“more-like-this” and faceting.

Getting set up (angle brackets, anyone?)

I’ve actually written another post
on setting up multi-core Solr on Ubuntu 10.04. I got a bit of flak for using
tomcat6 as the server – you can definitely go with jetty
instead. Jetty is bundled with Solr, check out the examples/README.txt to get
started quickly. You might find the following links useful:

When setting up search with haystack, there are two important configuration
files to be aware of:

  • schema.xml
  • solrconfig.xml


The Solr schema is only superficially analagous to a database schema (if your
database was just one big freaking table). It does a whole lot more than a
database schema, allowing you to configure how individual fields are tokenized,
filtered, stored, and searched. There is a high degree of configurability, so
if your needs go beyond a basic “site search” I’d recommend Solr 1.4 Enterprise Search Server
I’m only 5 chapters in and it’s already pretty much blown my mind. Luckily,
haystack will generate this file automatically,
allowing you to get up and running quickly.


I have not gone too deep into this file, but it is where you can configure
things like caching, more-like-this support, spell check, and highlighting. It
also gives you a whole bunch of knobs for configuring the inner-workings of the
indexing and querying facilities.

a final word on getting solr running

The Seven Deadly Sins of Solr

I am still very much a n00b when it comes to Search and am probably doing more
than a few things wrong. Any helpful suggestions would be appreciated!

(in fact, the search engine for djangosnippets is running on a 10-year-old
pentium iii laptop. the three hours last week where search was down? I was
rearranging my room)

Site Search

The first search-related feature I’ll discuss is the site search. The first
step was getting haystack installed and creating a SearchIndex for the snippets.
The SearchIndex usually mirrors the to some extent, although if you
plan on indexing more than a couple models you may want to pick some field-naming
conventions to keep the number of different fields in your Solr index small.

from haystack.indexes import *
from haystack import site
from cab.models import Snippet

class SnippetIndex(SearchIndex):
    text = CharField(document=True, use_template=True)
    author = CharField(model_attr='author__username')
    title = CharField(model_attr='title')
    tags = CharField()
    tag_list = MultiValueField()
    language = CharField(model_attr='language__name')
    pub_date = DateTimeField(model_attr='pub_date')
    django_version = FloatField(model_attr='django_version')
    bookmark_count = IntegerField(model_attr='bookmark_count')
    rating_score = IntegerField(model_attr='rating_score')
    url = CharField(indexed=False)

    def prepare_tags(self, obj):
        return ' '.join([ for tag in obj.tags.all()])

    def prepare_tag_list(self, obj):
        return [ for tag in obj.tags.all()]

    def prepare_url(self, obj):
        return obj.get_absolute_url()

    def get_updated_field(self):
        return 'updated_date'

site.register(Snippet, SnippetIndex)

There’s a lot of stuff in there, but the two more interesting bits are the first
and last fields. The text field is the default search field and is generated
by rendering the search/indexes/cab/snippet_text.txt template. Peeking at
the schema.xml, this field is getting tokenized and filtered:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>


<field name="text" type="text" indexed="true" stored="true" multiValued="false" />

This field is the heart of the index and is queried whenever a field is not
explicitly specified. Check out Analyzers, tokenizers and filters
if you’re intersted in reading up on what these various bits do.

The last field is unique because, as you can see in the field definition, it is
setting indexed=False. This yields the following line in the autogenerated

<field name="url" type="string" indexed="false" stored="true" multiValue="false" />

Because the field is not indexed it cannot be queried directly, but it will be
returned as a part of the search results, effectively allowing me to save a database
query when generating a link.

Indexing and Searching

Once the schema and are in place, I can index all the snippets
in the database by running rebuild_index. When I want to
update the index, I run update_index –age=[age in hours].
To get closer to real-time results try the Real-Time SearchIndex
that comes with haystack.

A basic search view can lean on haystack’s default. Here is the line from my

url(r'^search/$', 'haystack.views.basic_search', name='cab_search'),

Advanced Search

To get advanced search going, I subclassed SearchForm, added the fields I needed
and then basically did a shitload of filtering.

class AdvancedSearchForm(SearchForm):
    language = forms.ModelChoiceField(queryset=Language.objects.all(), required=False)
    django_version = forms.MultipleChoiceField(choices=DJANGO_VERSIONS, required=False)
    minimum_pub_date = forms.DateTimeField(widget=admin.widgets.AdminDateWidget,
    minimum_bookmark_count = forms.IntegerField(required=False)
    minimum_rating_score = forms.IntegerField(required=False)

    def search(self):
        # First, store the SearchQuerySet received from other processing.
        sqs = super(AdvancedSearchForm, self).search()

        if self.cleaned_data['language']:
            sqs = sqs.filter(language=self.cleaned_data['language'].name)

        if self.cleaned_data['django_version']:
            sqs = sqs.filter(django_version__in=self.cleaned_data['django_version'])

        if self.cleaned_data['minimum_pub_date']:
            sqs = sqs.filter(pub_date__gte=self.cleaned_data['minimum_pub_date'])

        if self.cleaned_data['minimum_bookmark_count']:
            sqs = sqs.filter(bookmark_count__gte=self.cleaned_data['minimum_bookmark_count'])

        if self.cleaned_data['minimum_rating_score']:
            sqs = sqs.filter(rating_score__gte=self.cleaned_data['minimum_rating_score'])

        return sqs

The relevant line in the urlconf looks like:

from haystack.views import SearchView, search_view_factory

from cab.forms import AdvancedSearchForm

url(r'^search/advanced/$', search_view_factory(
), name='cab_search_advanced'),

More like this

Snippets MLT

Haystack provides support for more-like-this out of the box. I needed to add
one line to my solrconfig.xml to enable MLT:

<requestHandler name="/mlt" class="solr.MoreLikeThisHandler" />

Make sure that line is present (or uncommented) and you’re good to go. I wrote
a short filter for use in the template:

def more_like_this(snippet, limit=None):
    sqs = SearchQuerySet().more_like_this(snippet)
    if limit is not None:
        sqs = sqs[:limit]
    return sqs

Haystack ships with a templatetag that offers a good deal more options.

This definitely qualifies as low-hanging fruit once you’ve got the initial pieces
in place and can really add a lot of value to your site. One of the problems
I often have with djangosnippets is that I get a lot of old content that’s been
upvoted to hell but there’s actually a newer, cooler version out there. MLT
is pretty good at finding these newer snippets.


Arguably, the feature I’m most excited about is Solr’s ability to do
autocompletion. Out of the box it’s possible to do wildcard searches but this
approach does not scale. It’s better to use the NGram filter, which I’ve
wrapped up as a custom fieldType in my schema.xml:

<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100"
  stored="false" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.NGramFilterFactory" minGramSize="3"
      maxGramSize="15" />

  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />

Then, I declare a title_ngram field and copy in the value of the title field:

<field name="title_ngram" type="ngram" />


<copyField source="title" dest="title_ngram" />

To get the results back out, it’s just a matter of querying the title_ngram
field with the user’s partial phrase:

def autocomplete(request):
    q = request.GET.get('q') or ''
    results = []
    if len(q) > 2:
        sqs = SearchQuerySet()
        result_set = sqs.filter(title_ngram=q)[:10]
        for obj in result_set:
                'title': obj.title,
                'url': obj.url
    return HttpResponse(json.dumps(results), mimetype='application/json')


Hope you found this post informative! There’s a ton of interesting things that
Solr can do and Haystack provides a nice wrapper around the most common features.
As always, any comments, feedback, suggestions, errata, etc are appreciated.

Read full article at “ Entries tagged with "django"”

Leave a comment