8

Optimizing Data Aggregation in Django for Many Records

 3 years ago
source link: https://snakeycode.wordpress.com/2020/02/07/optimizing-data-aggregation-in-django-for-many-records/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Optimizing Data Aggregation in Django for Many Records

If you are just working with a few records, it probably does not matter how you do things. But when you have to do calculations on lots of records, you can get dramatic differences in speed by seemingly small changes in how you do things.

In my case, I was working with more than 10,000 tickets. Each ticket could have zero or more labor records and zero or more equipment records. Here are the models:

class Ticket(models.Model):
work_date = models.DateField()
field1 = models.IntegerField()
field2 = models.IntegerField()
etc = models.IntegerField()
class Labor(models.Manager):
ticket = models.ForeignKey(Ticket)
revenue = models.FloatField()
etc = models.IntegerField()
class Equipment(models.Manager):
ticket = models.ForeignKey(Ticket)
revenue = models.FloatField()
etc = models.IntegerField()

Here is my first attempt at getting the total revenue for each ticket. It uses some of the optimization techniques recommended in the Django docs, most notably the “values_list” method and doing aggregation in the database.

def get_revenue(start_date, stop_date):
# If you do not need most of the functionality of a Ticket instance, then using values_list is much faster.
ticket_ids = Ticket.objects.filter(work_date__range=[start_date, stop_date]).values_list('id', flat=True)
tickets = {}
for ticket_id in ticket_ids:
# Doing aggregation in the database also speeds things up
labor = Labor.objects.filter(ticket_id=ticket_id).aggregate(Sum('revenue'))
eq = Equipment.objects.filter(ticket_id=ticket_id).aggregate(Sum('revenue'))
tickets[ticket_id] = labor['revenue__sum'] + eq['revenue__sum']
return tickets

One possible thing slowing this code down is that it hits the database twice for each ticket. Here is an alternative version:

def get_revenue2(start_date, stop_date):
# If you do not need most of the functionality of a Ticket instance, then using values_list is much faster.
ticket_ids = Ticket.objects.filter(work_date__range=[start_date, stop_date]).values_list('id', flat=True)
tickets = {}
for model in [Labor, Equipment]:
items = model.objects.filter(ticket_id__in=ticket_ids).values_list('id', 'revenue')
for item in items:
tickets[item[0]] = ticket_ids.setdefault(item[0], 0.0) + item[1]
return tickets

This code dramatically reduces the number of database hits at the cost of aggregating the results in python. It turns out that for 10,000 tickets, the second approach is X3 faster.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK