All notes in this series:
As your Django database schema changes over time, inevitably one day you will need to normalize a model. Typically this consists of pulling out a column into its own table, and using foreign keys to store the relationship.
Applying this change involves writing a migration which does two main things: updates the schema, and migrates the data. In Django it’s quite easy to make the schema change, but it’s also quite easy to write a very inefficient data migration.
These notes focus on the problem of writing an efficient data migration.
Example Scenario §
Say you are storing a map of servers to locations. The database table might look like the following:
id | location_name ------+-------------- 1 | FLOOR-3 2 | FLOOR-3 3 | FLOOR-2 ... 29873 | FLOOR-2 29874 | FLOOR-1 29875 | FLOOR-3
Initially you did this in Django with a simple
Server model which contains a
class Server(models.Model): location_name = models.CharField(max_length=16) # etc...
However, now you would like to normalize the schema by storing the location name in a separate table. The new tables should end up looking like this:
id | location_id ------+-------------- 1 | 1 2 | 1 3 | 2 ... 29873 | 2 29874 | 3 29875 | 1
id | name ------+-------------- 1 | FLOOR-3 2 | FLOOR-2 3 | FLOOR-1 ...
First, update your Django
models.py to introduce a separate
class Location(models.Model): name = models.CharField(max_length=16, unique=True) class Server(models.Model): location = models.ForeignKey(Location, on_delete=models.CASCADE) # etc...
makemigrations to auto-generate a migration that handles the schema change, and follow the data migration documentation to add a custom function that is called during the migration by
migrations.RunPython. What do we put in this custom function to migrate the data efficiently?
Naïve approach §
The naïve approach is to write the data migration from the perspective of the
Server, and loop through each row in the
# Not recommended for server in Server.objects.all(): location_fk, _ = Location.objects.get_or_create(name=server.location) server.location = location_fk server.save()
This will fire off two or three queries for each row in the
get_or_create()will peform a
SELECTquery to get the location, if it already exists.
- Additionally, if the location does not exist, the
get_or_create()will perform an
INSERTquery to create the location.
server.save()will perform an
UPDATEof a single row in the server model.
N_s as the number of servers:
naive_approach__num_of_queries__best_case = N_s * 2 naive_approach__num_of_queries__worst_case = N_s * 3
If the model is small, this will not be a problem, but if it is large (e.g. a million rows) this will likely be time consuming. This is a classic case of the 1+N problem (better known as the N+1 problem)
Better approach §
A better approach is to write the data migration from the perspective of the
Location: the field we are trying to normalize. By turning the approach on its head we actually run fewer queries!
First, run one query to get a list of all unique location names, then loop through that list, and “bulk” update each server row:
names = Server.objects.all().values_list("location_name", flat=True).distinct() for name in names: location_fk = Location.objects.create(name=name) Server.objects.filter(location_name=name).update(location=location_fk)
This will fire off one query, then two more queries for each location name:
Server.objects.all().values_list(...).distinct()will perform a
SELECT DISTINCTquery, which returns a list such as
["FLOOR-1", "FLOOR-2", "FLOOR-3", ...].
create()will perform an
INSERT. Since the location name is guaranteed to be unique, we do not need to do
filter().update()will perform an
UPDATEof all rows in the server model with that location name.
N_l as the number of locations:
better_approach__num_of_queries = 1 + (N_l * 2)
Every server has a unique location §
If every server has a unique location, then
N_s == N_l.
# Assume N_s = N_l = 1,000,000 naive_approach__num_of_queries__best_case = N_s * 2 = 2,000,000 naive_approach__num_of_queries__worst_case = N_s * 3 = 3,000,000 better_approach__num_of_queries = 1 + (N_l * 2) = 2,000,001
So if every server has a unique location, and every location already exists, then the naïve approach wins by a single query. Though in this case I would question why normalization is even being done? If each location is unique, then there is no need to normalize!
There are 100 servers per location §
In a more typical case, where there are say 100 servers per location, the numbers would look quite different:
# Assume N_s = 1,000,000, and N_l = 10,000 naive_approach__num_of_queries__best_case = N_s * 2 = 2,000,000 naive_approach__num_of_queries__worst_case = N_s * 3 = 3,000,000 better_approach__num_of_queries = 1 + (N_l * 2) = 20,001
So in this more typical case, the better approach performs only ~1% the number of queries compared to the naïve approach.
Key points §
- When writing Django migrations, take time to consider alternative perspectives that may result in fewer queries.
.save()in a loop is a code smell indicating that a different perspective would probably be more efficient.
- Fewer queries being faster is only a rule of thumb. If speed matters, make sure to profile your migrations.
- If the amount of data being migrated is sufficiently small, perhaps the naïve approach is good enough.
- If the amount of data being migrated is sufficiently big, perhaps the call to
.all().values_list(...).distinct()will pull too much data into Python’s memory. Consider chunking the call.