Sam Hooke

Django REST framework performance (part 2: techniques)

All notes in this series:

This is part 2 in a series of notes about how to improve the performance of Django REST framework. I plan to publish a new note each Friday.

If you haven’t already, I’d suggest starting by first reading part 1 on profiling.

Performance improvement techniques §

Let’s run through some things we can change that might help improve performance! 🔧

Take time to understand your database §

I’ve said it before and I’ll say it again: regardless of whether your database is MySQL, PostgreSQL, SQLite, or a sack of potatoes, your ability to optimise the performance of Django REST Framework will always be limited by your understanding of your database.

For databases in general, I cannot recommend enough Designing Data-Intensitve Applications by Martin Kleppmann. It introduces many concepts relating to databases in a good amount of detail, and provides a huge amount of references for further reading1.

As well as covering database concepts in general, it’s worth digging deeper into the specific database you are using.

For MySQL specifically, I found the book High Performance MySQL invaluable in improving my understanding of how MySQL server operates, and how to improve performance for MySQL specifically. Much of what I learnt is from this book. The book actually cautions against following certain advice that is often repeated online, on Q&A sites, and on blogs. So don’t take my word for it!

Avoid unnecessary counts §

Counting large tables is slow. Anecdotally, it can take multiple seconds to count a table with a few million rows.

Django REST Framework pagination includes a count field by default. If you do not need this, removing it can improve performance. To remove the count field, use a custom paginator, but do not include "count" in the response.

The Django REST Framework paginator inherits from the Django pagniator class, which has a small but important note mentioning how the self.count attribute works. It’s one of these sneaky attributes that has the potential to perform a lot of work under-the-hood. Here’s a starting point on how to remove the count.

If you want the count field, but do not need it to be precise, consider using an approximate count, which is much faster. MySQL keeps an approximate count for each table in the information_schema table. Remember, even Google display an approximate count for their search results. Here’s an example of doing an approximate count with PostgreSQL.

Avoid “N+1” queries §

N+1 queries are when you execute a single query (the 1), then for each result in that query you execute another query (the N). Typically the N queries are all quite similar, or even identical, and so can be reduced.

When using the Django Debug Toolbar N+1 queries are quite obvious. Helpful, Django Debug Toolbar is quite good at spotting them automatically, and will mark them with bold text saying “X similar queries”, “Duplicated Y times”. At the very top it will provide a summary saying “N queries including X similar and Y duplicates”.

In general, duplicate queries are a bad thing, and it should be possible to eliminate them entirely. The “similar” queries can be a little tricker, but are a good candidate for focusing effort on simplifying and combining.

Django’s prefetch_related() can be used to help solve these N+1 queries. Specifically, in Django REST Framework this is done by modifying the ViewSet classes so that the queryset uses prefect_related() to list all foreign keys that view uses. For example:

queryset = MyModel.objects.prefetch_related("foreign_model_1", "foreign_model_2", "foreign_model_3")

It will be obvious if this is done correctly, since in Django Debug Toolbar it the number of similar and/or duplicate queries will be reduced or eliminated entirely.

Note that prefetch_related() is subtly different to the similarly named method selected_related(). Crucially, prefect_related() performs the joining in Python while select_related() performs the joining in MySQL.

A common rule of thumb is to do as much work as possible in the database, rather than pulling lots of data into the application and performing the work there. This is often true, but not always! As a recent example, I compared using prefetch_related() with selected_related() on a particular endpoint, and prefetch_related() was significantly faster even though it resulted in a handful of queries rather than just one. This underlines the importance of profiling.2

Implement nested fields efficiently (or, avoid “N+1” queries, again) §

A common source of N+1 queries can be custom Django REST Framework serializers which override the to_representation() method. Since to_representation() is called for each row being converted from an object to JSON, any queryset requests here become an N+1 issue, because it causes a query to be fired off for each row.

Frequently, the reason for overriding to_representation() is to include other related models in the JSON. In most cases this can be much more efficiently implemented by using nested serializers, and making sure to use prefetch_related() in the corresponding ViewSet.

Enable MySQL to use Index Scan for Sorting §

This is a good example of a counter-intuitive performance improvement: some queries can be made faster by doing “more” ordering. In short, adding order_by() calls to the ViewSet queryset chain can have a big performance improvement, so long as it is all done in the right way. If done correctly, this allows MySQL to fetch the data directly from the index, rather than using the index to lookup corresponding rows in the database.

It’s hard to explain this succinctly better than the book3:

MySQL can use the same index for both sorting and finding rows. If possible, it’s a good idea to design your indexes so that they’re useful for both tasks at once.

The additional ORDER BY queries do not actually cause MySQL to do more ordering, but allow MySQL to take full advantage of the existing ordering of the index, and to fetch the requested rows at the same time. So long as all requested rows are in the index, and the ordering is all the same (e.g. all DESC), and the order of the ORDER BY queries matches the order of the indexes (assuming you have multiple indexes), then

You can use SHOW CREATE TABLE <table>\G to confirm the index order.

As an example, given the model:

from django.db import models

class MyModel(models.Model)
	# NOTE: Also has implicit "id" field with index enabled.
    timestamp = models.DateTimeField(db_index=True)

To make use of index scan for sorting, on a Django REST Framework ViewSet this might look like:

from rest_framework import viewsets

class MyViewSet(viewsets.ModelViewSet):
	queryset = MyModel.objects.order_by("-timestamp", "-id")
	#serializer_class = MySerializer
	#filterset_class = MyFilter
	#http_method_names = ["head", "options", "get"]

This change can have a surprisingly big performance improvement, but it requires careful collaboration between setting up the indexes and listing the necessary order_by() arguments in the right order. For more information, see “covering index” and this answer.

Avoid case insensitive searching §

In Django REST Framework filters, avoid using case insensitive string matching such as lookup_expr="iexact". This causes MySQL to use LIKE. If possible, use lookup_expr="exact", which uses the significantly faster = comparison in MySQL, but is case sensitive.

Filter by integer, not string §

Ideally, avoid doing string comparison at all, and prefer doing integer comparison when filtering.

This does not mean the REST API endpoint filters must all take integers instead of strings. The endpoints can accept filters as strings, but if the string is effectively an enumeration, then internally this should be mapped to an integer for the database query.

For example, storing integers which map onto strings (by using Django’s models.IntegerChoices enumeration types), instead storing the strings themselves (by using models.CharField), may be more efficient.

Remove unnecessary indexes and filtering §

Naively creating more and more indexes in an effort to make things faster is not the best idea; this results in a larger index. A smaller index allows MySQL to keep more of the index in memory, and reduces the amount of work required during a write.

This is done by removing db_index=True on the relevant fields the Django model.

What next? §

Consider these techniques as a starting point. For some techniques I gave more detail than others, so if they sound applicable, it’s probably worth doing further research.

The next note will give more details on a very specific and commonplace feature… a feature of practically every single Django REST Framework application[citation needed], and one that will appear to be working just fine at the start of a project, yet can easily end up becoming very slow and inefficient as your database grows: pagination!

  1. For example, I was working on a web application which required calculating percentiles. In Designing Data-Intensive Applications (Chapter 1, section “Percentiles in Practice”), there are a few paragraphs explaining how the naïve implementation requires sorting all the data to calculate the percentiles (which can become quite slow and inefficient), but points towards some papers on implementing good approximations of percentiles (which are faster). ↩︎

  2. In High Performance MySQL (Chapter 8, section “Complex Queries Verses Many Queries”) there are more details on why fewer queries is not always necessarily better/faster. ↩︎

  3. In High Performance MySQL (Chapter 7, section “Using Index Scans for Sorts”) there are more details on how this works, and more examples of queries that do and do not work with this technique. ↩︎

All notes in this series: