I recently found a way to speed up a large data import far more than I expected.
The task was to read data from a text file and create data records in Django, and the naive implementation was managing to import about 55 records per second, which was going to take far too long given the amount of data that needed to be imported.
My co-worker Karen Tracey suggested changing to bulk inserts. Instead of creating and saving one Django record at a time, we'd create a whole batch of Django objects, then save them all in one SQL operation. I figured reducing the number of database round-trips would speed things up somewhat, but was not prepared for the actual numbers - I'm consistently getting around two orders of magnitude improvement compared to single record inserts.
As I scaled up, I made one more change - instead of doing the insert in one batch, I limited each batch to a few hundred records. I didn't want to store an unlimited number of Django objects in memory at once, and some benchmarking showed that the benefit of batching the inserts leveled off at a few hundred records.
Caveats
There are a few differences from normal object creation. First, save() is not called on the instances, nor are post_save signals sent, and the model instances' primary keys are not set. If you're doing anything more complicated than dumping a bunch of data into the database, you'll probably need to stick with creating objects individually.
Also, the code we're using to do the bulk insert does not handle ForeignKeys properly. The workaround when creating the Django objects is to set the value of any ForeignKey field to the primary key of the object referred to, if any.
Example
Here's what code for a bulk insert might look like.
from bulkops import insert_many
from our_models import Book
objects = []
for data in data_source:
# Assume data['foreign_key'] is a reference to another model
# Change that to its primary key
data['foreign_key'] = data['foreign_key'].pk
objects.add(Book(**data))
# Keep our batch size from getting too big
if len(objects) > 200:
insert_many(objects)
objects = []
insert_many(objects)
Django 1.4
The current development branch of Django has added a bulk insert feature, which seems likely to be included in Django 1.4. It's very similar to the code we're using here - just change "insert_many(objects)" to "Book.objects.bulk_create(objects)". That's subject to change before Django 1.4 is released, of course.
Credit
Credit goes to Karen for suggesting the approach to me, and Ole Laursen's blog post for the original idea and the implementation that we're using.
Links
Ole Laursen's blog post: http://ole-laursen.blogspot.com/2010/11/bulk-inserting-django-objects.html
Implementation: http://people.iola.dk/olau/python/bulkops.py
Original commit to Django development: https://code.djangoproject.com/changeset/16739