34

Introduction to dtplyr

 5 years ago
source link: https://www.tuicool.com/articles/2EZn2mi
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Learn how to easily combine dplyr ’s readability with data.table ’s performance!

Jul 25 ·4min read

I recently saw a Tweet by Hadley Wickham about the release of dtplyr . It is a package that enables working with dplyr syntax on data.table objects. dtplyr automatically translates the dplyr syntax to the data.table equivalent, which in the end results in a performance boost.

Marvel: Infinity War is the most ambitious crossover event in history. Hadley Wickham: Hold my beer.

I always liked the ease and readability of dplyr and was eager to compare the performance of the package. Let’s see how it works in practice!

Loading libraries

For this article, we need to install dtplyr from GitHub by running devtools::install_github(“tidyverse/dtplyr”) and we use microbenchmark for performance comparison.

Generating the dataset

We generate an artificial dataset. The first thing that came to my mind is an order registry, in which we store:

id
name
date
amount
price

As this is only a toy example, we do not dive deeply into the logic behind the dataset. We can agree that it vaguely resembles a real-life scenario. For testing the performance of different approaches, we generate 10 million rows of data.

By using lazy_dt() we trigger the lazy evaluation — no computation is performed until we explicitly request it by using as.data.table() , as.data.frame() or as_tibble() . For the sake of comparison, we store one data.frame , one data.table and one “lazy” data.table .

We can preview the transformation, as well as the generated data.table code by printing the result:

Source: local data table [?? x 3]
Call: `_DT3`[date < as.Date("2019-02-01"), .(id, product, date)][order(date)]
id product date
<chr> <chr> <date>
1 DHQ GVF 2019-01-01
2 NUB ZIU 2019-01-01
3 CKW LJH 2019-01-01
4 AZO VIQ 2019-01-01
5 AQW AGD 2019-01-01
6 OBL NPC 2019-01-01

Generally, this should be used for debugging. We should indicate what kind of object we want to receive at the end of the pipeline to clearly show that we are done with the transformations.

Use-case 1: Filtering, Selecting and Sorting

Let’s say we want to have a list of transactions that happened before 2019–02–01, sorted by date, and we do not care about either the amount or price.

We see that dtplyr is slightly slower than data.table , but by looking at the median time it is ~4x faster than dplyr.

Use-case 2: Adding new variables after filtering

In this example, we want to filter orders with a number of products over 5000 and calculate the order value, which is amount * price .

Most of the expressions using mutate() must make a copy (do not modify in-place), which would not be necessary when using data.table directly. To counter for that, we can specify immutable = FALSE in lazy_dt() to opt-out of the mentioned behavior.

This time the difference is not so pronounced. This, of course, depends on the complexity of operations done to the tables.

Use-case 3: Aggregation on top

Let’s say we want to:

  • Filter all orders on amount <= 4000
  • Calculate the average order value per customer

This time we get ~3x improvement in median execution time.

Conclusions

dtplyr is (and always will be) slightly slower than data.table . That is because:

1. Each dplyr verb must be converted to a data.table equivalent. For large datasets, this should be negligible, as these translation operations take time proportional to the complexity of the input code, rather than the amount of data.

2. Some data.table expressions have no direct dplyr equivalent.

3. Immutability issue mentioned in use-case 2.

Summing up, I believe that dtplyr is a valuable addition to the tidyverse , as with only small changes to the dplyr code, we can achieve significant performance improvements.

As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments. You can find the code used for this article on my GitHub .


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK