Optimizing Data Aggregation in Django for Many Records
source link: https://snakeycode.wordpress.com/2020/02/07/optimizing-data-aggregation-in-django-for-many-records/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Optimizing Data Aggregation in Django for Many Records
If you are just working with a few records, it probably does not matter how you do things. But when you have to do calculations on lots of records, you can get dramatic differences in speed by seemingly small changes in how you do things.
In my case, I was working with more than 10,000 tickets. Each ticket could have zero or more labor records and zero or more equipment records. Here are the models:
class
Ticket(models.Model):
work_date
=
models.DateField()
field1
=
models.IntegerField()
field2
=
models.IntegerField()
etc
=
models.IntegerField()
class
Labor(models.Manager):
ticket
=
models.ForeignKey(Ticket)
revenue
=
models.FloatField()
etc
=
models.IntegerField()
class
Equipment(models.Manager):
ticket
=
models.ForeignKey(Ticket)
revenue
=
models.FloatField()
etc
=
models.IntegerField()
Here is my first attempt at getting the total revenue for each ticket. It uses some of the optimization techniques recommended in the Django docs, most notably the “values_list” method and doing aggregation in the database.
def
get_revenue(start_date, stop_date):
# If you do not need most of the functionality of a Ticket instance, then using values_list is much faster.
ticket_ids
=
Ticket.objects.
filter
(work_date__range
=
[start_date, stop_date]).values_list(
'id'
, flat
=
True
)
tickets
=
{}
for
ticket_id
in
ticket_ids:
# Doing aggregation in the database also speeds things up
labor
=
Labor.objects.
filter
(ticket_id
=
ticket_id).aggregate(
Sum
(
'revenue'
))
eq
=
Equipment.objects.
filter
(ticket_id
=
ticket_id).aggregate(
Sum
(
'revenue'
))
tickets[ticket_id]
=
labor[
'revenue__sum'
]
+
eq[
'revenue__sum'
]
return
tickets
One possible thing slowing this code down is that it hits the database twice for each ticket. Here is an alternative version:
def
get_revenue2(start_date, stop_date):
# If you do not need most of the functionality of a Ticket instance, then using values_list is much faster.
ticket_ids
=
Ticket.objects.
filter
(work_date__range
=
[start_date, stop_date]).values_list(
'id'
, flat
=
True
)
tickets
=
{}
for
model
in
[Labor, Equipment]:
items
=
model.objects.
filter
(ticket_id__in
=
ticket_ids).values_list(
'id'
,
'revenue'
)
for
item
in
items:
tickets[item[
0
]]
=
ticket_ids.setdefault(item[
0
],
0.0
)
+
item[
1
]
return
tickets
This code dramatically reduces the number of database hits at the cost of aggregating the results in python. It turns out that for 10,000 tickets, the second approach is X3 faster.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK