Making Pandas fast with Dask parallel computing.

Source: https://unsplash.com/photos/U66avewmxJk

So you, my dear Python enthusiast, have been learning Pandas and Matplotlib for a while and have written a super cool code to analyze your data and visualize it. You are ready to run your script that reads a huge file and all of a sudden your laptop starts making un ugly noise and burning like hell. Sounds familiar?

Well, I have got a couple of good news for you: this issue doesn’t need to happen anymore and you no, you don’t need to upgrade your laptop or your server.

Introducing Dask:

Dask is a flexible library for parallel computing with Python. It provides multi-core and distributed parallel execution on larger-than-memory datasets. It figures out how to break up large computations and route parts of them efficiently onto distributed hardware.

A massive cluster is not always the right choice

Today’s laptops and workstations are surprisingly powerful and, if used correctly, can handle datasets and computations for which we previously depended on clusters. A modern laptop has a multi-core CPU, 32GB of RAM, and flash-based hard drives that can stream through data several times faster than HDDs or SSDs of even a year or two ago.

As a result, Dask can empower analysts to manipulate 100GB+ datasets on their laptop or 1TB+ datasets on a workstation without bothering with the cluster at all.

The project has been a massive plus for the Python machine learning Ecosystem because it democratizes big data analysis. Not only can you save money on bigger servers, but also it copies the Pandas API so you can run your Panda script changing very few lines of code.

Comparing Dask to Spark and the problem of Big Data

Some years after the boom of the Hadoop ecosystem, many companies have already built a picture in their minds where they associate Spark as the fastest solution to run Python or other languages code across the nodes of a big data cluster. An indeed Spark has done a great job for many companies.

However, if you (like me) are a Python developer interested mainly in data analysis/data science, it will be hard to find a use case where you really “must” use Spark. The truth is many engineers are using big data clusters when in fact it is possible and easier to do the same things with scalable tools native to their own ecosystem. In this case, that’s exactly what Dask is: a Python native library for parallel computing.

Both Dask and Spark can scale from a single node to a thousand-node cluster. But while Spark’s internal model is higher-level, providing good high-level optimizations on uniformly applied computations, Dask’s internal mode is lower level, but providing flexibility for more complex algorithms.

Dask is lighter than Spark and is a component of the larger Python ecosystem. It couples with and enhances other libraries like NumPy, Pandas, and Scikit-Learn and is applied to both BI and scientific problems. Dask’s DataFrame reuses the Pandas API and memory model. It implements neither SQL nor a query optimizer. It is able to do random access, efficient time series operations, and other Pandas-style indexed operations. It also fully supports the NumPy model for scalable multi-dimensional arrays.

But perhaps where Dask shines the most is in situations where the problem is CPU bounded. Dask can scale scikit-learn to a cluster of machines for a CPU-bound problem.

OK now that we know all the advantages Dask can provide, let’s get hands-on!

Translating Pandas to Dask, a quick example:

Let’s have a look at a very typical problem we have in Machine Learning and Data Science these days: files with images.

I have selected two .csv files with images of different sizes to compare Pandas and Dask performance. Both files contain pixels and spatial coordinates information. Every single row has information for one image.

The file called “small.csv” has 25000 rows and 242 MB. The one called “big.csv” has 300000 rows and 1.6 GB.

You can download them here.

Let’s read the file and plot an image using Pandas:

When we run the script above Pandas reads the file and it creates a DataFrame, then Matplotlib plots this image for the first row of the file:

making-pandas-fast-with-dask-parallel-computing-9ffaf9a66d72

Plot for the first row of the file (row[0]). Source: Author

Now let’s run the same code but substituting the previous file (small.csv) for the second and bigger one (big.csv). Click run and see what happens.

In my laptop, it runs for a while and then crashes. Maybe if you have a very powerful computer it didn’t crush, but it certainly took longer. This is natural behavior since the file is bigger.

OK, time to try Dask: you just need to introduce these tiny changes in your code:

Now I run my script and it doesn’t crash anymore, and plots this nice spiral:

Plot for the first row of the file (row[0]). Source: Author

In fact, it runs quickly. You can compare the time performance of both scripts just by adding %timeit on top if you are using Jupyter Notebook.

OK, so what happened under the hood actually was that Dask partitioned the data in several chunks and distributed our code running in parallel on those chunks. I know, it’s cool.

So the key take away is this:

if you need to perform a quick analysis on big files it is not necessary to use pure big data tools nor rent a bigger server. Dask allows a quick and simple implementation for our analysis for free.

If you want to have a glance at Dask documentation you will find that most commonPandas functions are available on Dask although the Pandas API is huge and obviously many things are not covered yet. It also provides support for Numpy code.

One last thing to keep in mind is that Dask is only useful for files larger than your computer’s memory.

If you were thinking of using Dask you speed up your code in a file that can be handled with Pandas, well, let’s say that’s a very bad idea. You will only make your code slower.

OK, so that was it for today, hope you enjoyed Dask and feel ready to give a try and supercharge your Pandas scripts. Happy coding!

Making Pandas fast with Dask parallel computing.

Making Pandas fast with Dask parallel computing.

Introducing Dask:

Recommend

数据中心迎碳达峰挑战

Poker With Python

AWS Elastic Load Balancing from a serverless perspective [infographic]

Custom Resource Types with Dynamic Types : An AWS EKS Clusters Use Case

Content Optimization: Suggestions for Every Social Media Platform

How to do Bias-Variance Tradeoff the Right Way

Getting Started with DesignKit in Figma

Poker Simulation with Python

Sagar Chhatrala (@sagarchhatrala) / FAUN

Pavel G (@hej_pavel) / FAUN

About Joyk