Classic pipeline example

Assume that we have a text file containing a list of animals and their properties:

large white cat
medium black cat
big yellow dog
small yellow cat
small white dog
medium green turtle

We can pass this file through a pipeline:

cat animals.txt | grep dog | cut -d " " -f 2 | tr a-z A-Z

Particular steps of the pipeline are separated by the | pipe symbol. In the first step, we just read the file and print it on STDOUT.1 In the second step, we filter only dogs and get:

big yellow dog
small white dog

In the third step, we select second field (fields are separated by spaces) and get colours of our dogs:

yellow
white

In the fourth step, we translate the values to uppercase and get:

YELLOW
WHITE

So we have a list of colors of our dogs printed in big letters. In case we have several dogs of same color, we could avoid duplicates simply by adding | sort -u in the pipeline (after the cut step).

The great parts

The authors of cat, grep, cut or tr programs do not have to know anything about cats2 and dogs and our business domain. They can focus on their tasks which are reading files, filtering by regular expressions, doing some substrings and text conversions. And they do it well without being distracted by any animals.

And we do not have to know anything about the low-level programming in the C language or compile anything. We just simply build a pipeline in a shell (e.g. GNU Bash) from existing programs and focus on our business logic. And we do it well without being distracted by any low-level issues.

Each program used in the pipeline can be written in different programming language and they will work together. Tools written in C, C++, D, Java, Lisp, Perl, Python, Rust or any other language can be combined together. Thus optimal language can be used for each task.

The pipeline is reusable which is a big advantage compared to ad-hoc operations done in CLI or GUI tools. Thus we can feed different data (with same structure of course) into the pipeline and get desired result.

Particular steps in the pipeline can be added, removed or exchanged. And we can also debug the pipeline and check what each step produces (e.g. use tee to duplicate the intermediary outputs to a file or just execute only some first steps of the pipeline).

The pitfalls

This simple example looks quite flawless. But actually it is very brittle.

What if we have a very big cat that can be described by this line in our file?

dog-sized red cat

In the second step of the pipeline (grep) we will include this record and the final result will be:

RED
YELLOW
WHITE

Which is really unexpected and unwanted result. We do not have a RED dog and this is just an accident. The same would happen if we have a monkey of a doggish color.

This problem is caused by the fact that the grep dog filters lines containing the word dog regardless its position (first, second or third field). Sometimes we could avoid such problems by a bit more complicated regular expression and/or by using Perl, but our pipeline would not be as simple and legible as before.

What if we have a turtle that has lighter color than other turtles?

small light green turtle

If we do grep turtle it will work well in this case, but our pipeline will fail in the third step where the cut will select only light (instead of light green). And the final result will be:

GREEN
LIGHT

Which is definitively wrong because the second turtle is not LIGHT, it is LIGHT GREEN. This problem is caused by the fact that we do not have a well-defined separators between fields. Sometimes we could avoid such problems by restrictions/presumptions e.g. the color must not contain a space character (we could replace spaces by hyphens). Or we could use some other field delimiter e.g. ; or | or ,. But still we would not be able to use such character in the field values. So we must invent some kind of escaping (like \; is not a separator but a part of the field value) or add some quotes/apostrophes (which still requires escaping, because what if we have e.g. name field containing an apostrophe?). And parsing such inputs by classic tools and regular expressions is not easy and sometimes even not possible.

There are also other problems like character encoding, missing meta-data (e.g. field names and types), joining multiple files (Is there always a new-line character at the end of the file? Or is there some nasty BOM at the beginning of the file?) or passing several types of data in a single stream (we have list of animals and we can have e.g. also a list of foods or list of our staff where each list has different fields).

Classic pipeline example

Classic pipeline example

The great parts

The pitfalls

Recommend

How To Create Admin Interfaces in React with react-admin

"The J Programming Language" by Tracy Harms (2013)

加密货币作为支付选项获得主流关注

美联储理事认为CBDC“是一个没有任何必要性的解决方案”

论文赏析：基于NVM的高性能向量检索方案HM-ANN

英特尔深度融合数据中心与开放计算，加速数字化转型

Vodafone claims UK first with voice-integrated Wi-Fi booster with Alexa built-in

陌陌改名Hello，百事出售旗下多个果汁品牌>>本周动态

Scott Mabin

阿里2022财年Q1营收2057.4亿元，同比增长33.8%

About Joyk