Spark-like distributed datasets in under 100 lines of Unison

This article is part of a series of compositional distributed systems

Hi there! This is the first in a series of articles oncompositional distributed systems,showing various neat things you can do with Unison's distributed programming support. This article presents an API for computations on distributed datasets. It's a bit like Spark, though we'll make different design choices in a few key areas. Our library will be absolutely tiny — the core data type is 3 lines of code, and operations likeSeq.mapare just a handful of lines each.

You can try writing programs with this API today, using theRemote.pure.runlocal interpreter. It's not very fast but that's okay for development and testing. Programs written usingRemotecan run unchanged on actual distributed infrastructure such asUnison Cloud.

Spark is a huge project (over a million lines of code) with lots of functionality, but we're interested in just the core idea of a distributed immutable dataset that can be operated on in parallel.

Here's a quick preview of what we'll get to in this article:

This code will operate in parallel on a distributed sequence of whatever size, spread across any number of machines. The functions are "moved to the data" so there's little network traffic during execution, and the distributed sequenceSeqis lazy so nothing actually happens until theSeq.reduce.Also, the data structure fuses all operations so they happen in one pass over the data.

Any stage of the computation can be cached via functions likeSeq.memoormemoReduce.These functions cache subtrees of the computation as well, not just the root, so repeated runs of related jobs only have to compute on the diff of what's new. Imagine batch jobs that finish in seconds rather than hours.

Developing distributed programs doesn't require building containers or deploying jar files. Due to Unison's design, dependency conflicts are impossible and there's no serialization code to write. We can focus on the core data structures, algorithms, and business logic.

To test your code, we can use a simple local interpreter such asRemote.pure.run:

... or perhaps a more interesting "chaos monkey" local interpreter that injects failures and delays and simulated network partitions. Interpreters are also possible that determine the network usage of your program by local simulation ("Oh no! This sort is shuffling lots of data over the network!") allowing you to diagnose and fix performance problems without having to deploy or run jobs in production.

We get all this from a library that is tiny (the core data type is just 3 lines) and extensible. For instance,Seq.mapis a few lines of straightforward code, and any user is empowered to add new operations. We'll explain this code and how it works next in the tutorial:

Operations on the distributed data structure end up looking similar to an in-memory version, just with an extraValue.mapto move the computation inside aremotedistributed.Value.Any immutable data structure can be turned into a distributed version by wrapping references in remote values.

Continue on to the tutorial for a detailed explanation.

Spark-like distributed datasets in under 100 lines of Unison

Spark-like distributed datasets in under 100 lines of Unison

Recommend

Important Information on ESXi 7 Update 3

After Thirty Years of 'Nothing But Success'… This Entrepreneur Takes On a New Ch...

Simultaneous repairs cause suspension of repair

A simple POST request to Web API does not hit the API at all

10 Definitions of Design According to 10 Experts

Continuously request an image until it is found

List<T> Class

This Cellphone Was Created To Make the Internet More Safe For Children

15 healthcare product design insights

Using Rules-Based Cloud Dimensions | Flexera Blog

About Joyk