Generating aggregate data across generic relations

Edit: I’ve created a github repo for performing generic aggregation and annotation based on the code from this entry

Aggregation support was added to django’s ORM in version 1.1, allowing you to generate Sums, Counts, and more without having to write any SQL.

According to the docs aggregation is not supported for generic relations. This entry describes how to work around this using the .extra() method.

The state of the art

To take an example from the docs, it is possible to span relationships when performing aggregations:

>>> Store.objects.annotate(min_price=Min('books__price'), max_price=Max('books__price'))

Here we are querying the Store object and annotating the result set with two extra attributes, ‘min_price’ and ‘max_price’, which contain the minimum and maximum price of books that are sold at that store. Conversely, if we want to find the minimum and maximum book price over the entire queryset, you would write:

>>> Store.objects.aggregate(min_price=Min('books__price'), max_price=Max('books__price'))

The aggregate() method returns a dictionary as opposed to a queryset. This is an incredibly clean API!

Suppose you want to aggregate across a GFK

This is tricky. A generic foreign key is comprised of two attributes, a ContentType and a foreign key. The models that are GFKed to do not contain a reverse relationship to the GFK model by default. This is a little obscure, but basically it means that you can create a django Comment on any object since it supports generic relations but you can’t go from any model to their associated comments (without creating a reverse generic relation).

So assume you have a simple weblog and would like to sort entries by which have the most comments. Unfortunately, this does not work:

class Entry(models.Model):
    ...
    comments = generic.GenericRelation(Comment) # reverse generic relation
    ...


Entry.objects.annotate(count=Count('comments')) # does not work!

There’s a ticket for this, but it is marked for the 1.3 milestone. So how I solved this in the meantime was to use the extra method. Here’s what it might look like:

from django.db.models import Count
from django.contrib.comments.models import Comment
from django.contrib.contenttypes.models import ContentType
from blog.models import Entry

ctype = ContentType.objects.get_for_model(Entry)
qs = Entry.objects.extra(select={
    'count': """
        SELECT COUNT(*) AS comment_count
        FROM django_comments
        WHERE
            content_type_id=%s AND
            object_pk=CAST(blog_entries.id as text)
        """
    },
    select_params=[ctype.pk],
    order_by=['-count'])

This code essentially performs a subquery for every entry returned, which calculates the number of comments associated with that entry. Two ‘magical’ things are happening here:

  1. Inside the CAST() we’re referring to blog_entries.id — this is a field that is being retrieved by default, outside of the inner query.
  2. The order_by is inside the extra() — note how the COUNT() function is returning its result as “comment_count”, but the select dictionary is keyed using “count”. It is the key we specify that can be used for custom ordering, as opposed to whatever is used in the query.

Genericizing it

def generic_annotate(queryset, gfk_field, aggregate_field, aggregator=models.Sum, desc=True):
    ordering = desc and '-score' or 'score'
    content_type = ContentType.objects.get_for_model(queryset.model)

    # collect the params we'll be using
    params = (
        aggregator.name, # the function that's doing the aggregation
        aggregate_field, # the field containing the value to aggregate
        gfk_field.model._meta.db_table, # table holding gfk'd item info
        gfk_field.ct_field, # the content_type field on the GFK
        content_type.pk, # the content_type id we need to match
        gfk_field.fk_field, # the object_id field on the GFK
        queryset.model._meta.db_table, # the table and pk from the main
        queryset.model._meta.pk.name   # part of the query
    )

    queryset = queryset.extra(select={
        'score': """
            SELECT %s(%s) AS aggregate_score
            FROM %s
            WHERE
                %s_id=%s AND
                %s=%s.%s
        """ % params
    },
    order_by=[ordering])

    return queryset
>>> from misc import generic_annotate
>>> from blog.models import Entry
>>> from tagging.models import TaggedItem
>>> from django.db.models import Count
>>> qs = generic_annotate(Entry.objects.all(), TaggedItem.object, 'id', Count)
>>> qs[0].score
5L

>>> qs[1].score
4L

>>> qs[1].tags
u'databases django many-to-many python'

Note that this example works for situations in which you’d use annotate(), but it doesn’t allow you to mimic aggregate(), which is useful for generating summary-type data about items in a queryset. This is not too difficult to accomplish – simply invert the logic a bit so that we generate the aggregate on the outside and the result set on the inside:

def generic_aggregate(queryset, gfk_field, aggregate_field, aggregator=models.Sum):
    content_type = ContentType.objects.get_for_model(queryset.model)

    queryset = queryset.values_list('pk') # just the pks
    sql = queryset.query.as_sql()[0] % queryset.query.as_sql()[1]

    # collect the params we'll be using
    params = (
        aggregator.name, # the function that's doing the aggregation
        aggregate_field, # the field containing the value to aggregate
        gfk_field.model._meta.db_table, # table holding gfk'd item info
        gfk_field.ct_field, # the content_type field on the GFK
        content_type.pk, # the content_type id we need to match
        gfk_field.fk_field, # the object_id field on the GFK
        sql
    )

    query = """
            SELECT %s(%s) AS aggregate_score
            FROM %s
            WHERE
                %s_id=%s AND
                %s IN (
                    %s
                )
        """ % params

    cursor = connection.cursor()
    cursor.execute(query)
    row = cursor.fetchone()

    return row[0]
>>> from misc import generic_aggregate
>>> from blog.models import Entry
>>> from tagging.models import TaggedItem
>>> from django.db.models import Count
>>> qs = generic_aggregate(Entry.objects.all(), TaggedItem.object, 'id', Count)
106L # the total number of times a tag was added to an entry

More!

Original Source

Leave a comment