Introduction to dtplyr

Learn how to easily combine `dplyr` ’s readability with `data.table` ’s performance!

Jul 25 ·4min read

I recently saw a Tweet by Hadley Wickham about the release of dtplyr . It is a package that enables working with dplyr syntax on data.table objects. dtplyr automatically translates the dplyr syntax to the data.table equivalent, which in the end results in a performance boost.

Marvel: Infinity War is the most ambitious crossover event in history. Hadley Wickham: Hold my beer.

I always liked the ease and readability of dplyr and was eager to compare the performance of the package. Let’s see how it works in practice!

Loading libraries

For this article, we need to install dtplyr from GitHub by running devtools::install_github(“tidyverse/dtplyr”) and we use microbenchmark for performance comparison.

Generating the dataset

We generate an artificial dataset. The first thing that came to my mind is an order registry, in which we store:

id
name
date
amount
price

As this is only a toy example, we do not dive deeply into the logic behind the dataset. We can agree that it vaguely resembles a real-life scenario. For testing the performance of different approaches, we generate 10 million rows of data.

By using lazy_dt() we trigger the lazy evaluation — no computation is performed until we explicitly request it by using as.data.table() , as.data.frame() or as_tibble() . For the sake of comparison, we store one data.frame , one data.table and one “lazy” data.table .

We can preview the transformation, as well as the generated data.table code by printing the result:

Source: local data table [?? x 3]
Call:   `_DT3`[date < as.Date("2019-02-01"), .(id, product, date)][order(date)]  id    product date      
<chr> <chr>   <date>    
1 DHQ   GVF     2019-01-01
2 NUB   ZIU     2019-01-01
3 CKW   LJH     2019-01-01
4 AZO   VIQ     2019-01-01
5 AQW   AGD     2019-01-01
6 OBL   NPC     2019-01-01

Generally, this should be used for debugging. We should indicate what kind of object we want to receive at the end of the pipeline to clearly show that we are done with the transformations.

Use-case 1: Filtering, Selecting and Sorting

Let’s say we want to have a list of transactions that happened before 2019–02–01, sorted by date, and we do not care about either the amount or price.

We see that dtplyr is slightly slower than data.table , but by looking at the median time it is ~4x faster than dplyr.

Use-case 2: Adding new variables after filtering

In this example, we want to filter orders with a number of products over 5000 and calculate the order value, which is amount * price .

Most of the expressions using mutate() must make a copy (do not modify in-place), which would not be necessary when using data.table directly. To counter for that, we can specify immutable = FALSE in lazy_dt() to opt-out of the mentioned behavior.

This time the difference is not so pronounced. This, of course, depends on the complexity of operations done to the tables.

Use-case 3: Aggregation on top

Let’s say we want to:

Filter all orders on amount <= 4000
Calculate the average order value per customer

This time we get ~3x improvement in median execution time.

Conclusions

dtplyr is (and always will be) slightly slower than data.table . That is because:

1. Each dplyr verb must be converted to a data.table equivalent. For large datasets, this should be negligible, as these translation operations take time proportional to the complexity of the input code, rather than the amount of data.

2. Some data.table expressions have no direct dplyr equivalent.

3. Immutability issue mentioned in use-case 2.

Summing up, I believe that dtplyr is a valuable addition to the tidyverse , as with only small changes to the dplyr code, we can achieve significant performance improvements.

As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments. You can find the code used for this article on my GitHub .

Learn how to easily combine `dplyr` ’s readability with `data.table` ’s performance!

Loading libraries

Generating the dataset

Use-case 1: Filtering, Selecting and Sorting

Use-case 2: Adding new variables after filtering

Use-case 3: Aggregation on top

Conclusions

Recommend

Docker入门 - 如何创建你的第一个Docker应用

二叉查找树（BST）

编写第一个 Flutter 应用（第一篇）

聊聊单机房故障自愈中的经济学：投资与收益

FIBOS 链上资源模型介绍

Visual Studio Code C/C++ Extension: July 2019 Update

FIBOS 超级节点选举以及提案多签介绍

Introducing AWS Chatbot: ChatOps for AWS

Animate React with Framer Motion

nutsdb 单机 1 亿、10 亿数据实测分享

About Joyk

Introduction to dtplyr

Learn how to easily combine dplyr ’s readability with data.table ’s performance!

Loading libraries

Generating the dataset

Use-case 1: Filtering, Selecting and Sorting

Use-case 2: Adding new variables after filtering

Use-case 3: Aggregation on top

Conclusions

Recommend

About Joyk

Learn how to easily combine `dplyr` ’s readability with `data.table` ’s performance!