3

Argo vs Airflow vs Prefect: How Are They Different

 1 year ago
source link: https://neptune.ai/blog/argo-vs-airflow-vs-prefect-differences
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Argo vs Airflow vs Prefect: How Are They Different

We live at a stage where ML and DL software are everywhere. New startups and various other companies are adapting and integrating AI systems into their new and already existing workflows to be much more productive and efficient. These systems reduce manual tasks and deliver smart and intelligent solutions. Although they are quite proficient in what they do, all AI systems have different modules that must be brought together to build an operational and effective product. 

These systems can be broadly divided into five phases, keeping in mind that these phases contain various additional and repetitive tasks:

  • 1Data collection 
  • 2Feature engineering
  • 3Modeling (which includes training, validation, testing, and inference)
  • 4Deployment 
  • 5Monitoring

Executing these phases individually can take a lot of time and continuous human effort. These phases must be synchronized and sequentially orchestrated in order to get the best out of them. This can be achieved by task orchestration tools that enable ML practitioners to effortlessly bring together and orchestrate different phases of an AI system.

Phases of AI systems
Phases of AI systems | Source

In this article, we will explore:

  • 1What task orchestration tools are?
  • 2Three different tools that can help ML practitioners to orchestrate their workflow.
  • 3Comparison of the three tools
  • 4Which tool to use and when?

Task orchestration tools: What they are and how are they useful?

Orchestration tools enable various tasks in MLOps to be organized and sequentially executed. These tools have the capability to orchestrate different tasks at a given period. One of the key properties of these tools is the distribution of tasks. Most of the tools leverage what is known as the DAG or Directed Acyclic Graph, which you will often come across in this article. A DAG is a graph representation of the tasks that need to be executed. 

Explanation of DAG
Graphic explanation of DAG | Source 

DAG enables tasks in a pipeline to be distributed parallelly to various other modules for processing, this offers efficiency. See the image above. DAG also enables tasks to be sequentially sound or arranged for proper execution and timely results.

Another important property that these tools have is adaptability to agile environments. This allows ML practitioners to incorporate various other tools that can be used to monitor, deploy, analyze and preprocess, test, infer, et cetera. If an orchestration tool can orchestrate various tasks from different tools, then it can be considered a good tool. But this is not the case every time, some of the tools are strictly contained within their derived environments, which does not bode well for users trying to integrate any third-party applications. 

In this article, we will explore three tools – Argo, Airflow, and Prefect, that incorporate these two properties and various others as well. 

TL;DR comparison table 

Here is a table inspired by Ian McGraw’s article, which provides an overview of what these tools offer for orchestration and how they differ from each other in these aspects.

Features
Airflow
Prefect

1.

Fault-tolerant scheduling

2.

UI Support

3.

Workflow definitionlanguage

Python

Pytchon

4.

3rd partyintegration

Since Argo is container-based it doesn’t come with pre-installed 3rd party systems.

Supports various 3rd party integration

Supports various 3rd party integration

5.

Workflows

Dynamic workflow

Static workflow

Dynamic workflow

6.

Accessibility

Open-sourced

Open-source

Hybrid (Open-sourced and subscription-based)

7.

Parametrized workflows

Have an extensive parameter-passing syntax.

Does not has a mechanism to pass parameter.

Supports parameters as first-class object

8.

Kubernetes support

9.

Scalability

Highly Parallel

Horizontal scalable

Parallel when using Kubernetes

10.

Community Support

Large

Large

Medium

11.

State storage

All states are stored within the Kubernetes workflow

Postgres DB

Postgres DB

12.

Ease of deployment

Medium

Medium

Difficult

13.

Event-driven workflows

14.

Scripts in DAG definition

Argo uses text scripts to pass in containers.

Airflow uses Python-based DAG definition language.

Perfect uses functional flow a Python-based API.

15.

Use Cases

– CI/CD- Data Processing- Infrastructure  Automation- Machine Learning- Stream Processing

– ELT – ML Workflow- ML Automation

– Automating Data Workflow (ELT)- ML Workflow and Orchestration- CI/CD

Now let’s explore each of these tools in more detail under three primary categories: 

  • 1Core concepts
  • 2Features they offer
  • 3Why use it?

Core concepts

All three tools are built on a set of concepts or principles around which they function. Argo is, for instance, built around two concepts: Workflow and Templates. Both of these make the backbone of its system. Likewise, Airflow is built around Webserver, Scheduler, Executor, and Database, while Prefect is built around Flows and Task. Now it is important for us to know what these concepts mean, what they offer, and how it is beneficial to us.

Before going into the details, here is a brief summary of the concepts. 

Properties of the Concepts

Argo

It has two concepts Workflow, and Templates. Essentially the Workflow is the config YAML file. It provides structure and robustness to the workflow as they use DAGs to manage the workflows. On the other hand, templates are the functions that need to be executed. 
They are both static and dynamic meaning that you can modify steps on the go.

Airflow

It has four concepts Webserver, Scheduler, Executor, and Database. They basically divide the whole process into different segments and these concepts act as major components to automate the whole process. This allows the workflow to be efficient since each component relies on the other, in this way it is easy to find and report bugs and errors. Furthermore, monitoring is quite easy.
Though Airflow uses DAGs it is not dynamic but only static.

Prefect

It leverages two concepts Flows and Tasks. Prefect uses DAGs that are defined as flow object which uses Python. In Prefect, flow objects can be created using Python which provides flexibility and robustness to define complex pipelines.
Tasks are like templates in Argo which are used to define a specific function that needs to be executed. Again, it uses Python for this.
Because Prefect uses Python as its main programming language it is easy to work with.

Summary of the concepts

Now, let’s understand these concepts in detail. 

Argo uses two core concepts:

  1. Workflow
  2. Templates

Workflow

In Argo, the workflow happens to be the most integral component of the whole system. It has two important functions: 

  1. It defines the tasks that need to be executed.
  2. It stores the state of the tasks, which means that it serves as both a static and a dynamic object.

Workflow is defined in the workflow.spec configuration file. It is a YAML file that consists of a list of templates and entry points. The Workflow can be considered as a file that hosts different templates. These templates define the function that needs to be executed. 

As mentioned earlier that Argo leverages the Kubernetes engine for workflow synchronization, and the configuration file uses the same syntax as Kubernetes. The workflow YAML file has the following dictionaries or objects:

  1. apiVersion: This is where you define the name of the doc or API.
  2. kind: It defines the type of Kubernetes object that needs to be created. For instance, if you want to deploy an app you can use Deployment as one of a kind, at other times you can use service. But in this case, we will use Workflow.
  3. metadata: It enables us to define unique properties for that object, that could be a name, UUID, et cetera. 
  4. spec: It enables us to define specifications concerning the Workflow. These specifications would be entry points and templates. 
  5. templates: This is where we can define the tasks. The template can contain the docker image and various other scripts. 

Templates 

In Argo, there are two types of templates which again are sub-classified into 6 types. The two major types are definition and invocators. 

Definition

This template, as the name suggests, defines the type of task in a Docker container. The Definition itself is divided into four categories:

  1. Container: It enables users to schedule the workflow in a container. Since the application is containerized in Kubernetes, the steps defined in the YAML file are identical. It is also one of the most used templates.


#source: https://argoproj.github.io/argo-workflows/workflow-concepts/ - name: whalesay container: image: docker/whalesay command: [cowsay] args: ["hello world"]
  1. Script: If you want a wrapper around a container, then the script template is perfect. The script template is similar in structure to the container template but adds a source field. The field allows you to define a script in place. You can define any variable or command based on your requirements. Once defined, the script will be saved into a file, and it will be executed for you as an Argo variable.


#source: https://argoproj.github.io/argo-workflows/workflow-concepts/ - name: gen-random-int script: image: python:alpine3.6 command: [python] source: import random i = random.randint(1, 100) print(i)
  1. Resource: It allows you to perform operations like get, create, apply, delete et cetera on the K8 cluster directly.


#source: https://argoproj.github.io/argo-workflows/workflow-concepts/ - name: k8s-owner-reference resource: action: create manifest: | apiVersion: v1 kind: ConfigMap metadata: generateName: owned-eg- data: some: value
  1. Suspend: It basically introduces a time dimension to the workflow. It can suspend the execution of the workflow for a defined duration or till the workflow is resumed manually. 


#source: https://argoproj.github.io/argo-workflows/workflow-concepts/ - name: delay suspend: duration: "20s"
Invocators

Once the templates are defined, they can be invoked or called on demand by other templates called invocators. These invocators are more of controllers templates that can control the execution of defined templates. 

There are two types of invocator templates:

  1. Steps: It basically allows you to define the tasks in steps. All YAML files are enabled with the ‘steps’ template. 
  2. Directed acyclic graph: Argo enables its users to manage steps with multiple dependencies in their workflow. This allows parallel execution of different workflows in their respective containers. These types of workflows are managed using a directed acyclic graph or DAG. For instance, if you are working on image segmentation and generation for medical purposes then you can create a pipeline that:
    • Processes the images.
    • Distributes the images (or dataset) to the respective DL models for image segmentation and generation pipeline.
    • Continuously predicts segmentation masks and updates the dataset storage with new images after proper inspection. 

Airflow

Feature Pipeline- Airflow
Feature Pipeline | Source

Apache Airflow consists of four main components:

  1. Webserver
  2. Scheduler
  3. Executor
  4. Database
Main components of Apache Airflow
Four main components of Apache Airflow | Source

Webserver

It provides the user with UI for inspecting, triggering, and debugging all DAGs and tasks. It essentially serves as the entry point for Airflow. The Webserver leverages Python-Flask to manage all the requests made by the user. It also renders the state metadata from the database and displays the same to the UI.

Scheduler

It monitors and manages all the tasks and DAGs. It examines the state of the tasks by querying the database to decide the order of the task that needs to be executed. The aim of the scheduler is then to resolve dependencies and submit the task instance to the executor once the dependencies are taken care of.

Executor

It runs the task instances which are ready to run. It executes all the tasks as scheduled by the scheduler. There are four types of executors:

  1. Sequential Executor
  2. Local Executor
  3. Celery Executor
  4. Kubernetes Executor

Metadata Database

It stores the state of the tasks and DAGs that can be used by the scheduler for proper scheduling of the tasks instance. It is worth noting that Airflow uses SQLAlchemy and Object Relational Mapping (ORM) to store the information. 

Prefect

Prefect uses two core concepts: 

  1. Flows
  2. Tasks

Flows

In Prefect, flows are the Python objects that can be interacted with. Here DAG is defined as flow objects. See the image below. 

DAG defined as flow objects
DAG defined as flow objects | Source

Flow can be imported and can be used as a decorator, @flow, for any given function. Flows take an existing function and transform it into a Prefect flow function, with the following advantages:

  • The function can be monitored and governed as it is now reported to the API.
  • The activity of the function can be tracked and displayed in the UI.
  • Inputs given to the function can be validated.
  • Various workflow features like retries, distributed execution et cetera can be added to the function.
  • Timeouts can be enforced to prevent unintentional long-running workflows 

Here is a code block depicting the implementation of a flow object.



#Source: https://github.com/PrefectHQ/prefect from prefect import flow

@flow(name="GitHub Stars") def github_stars(repos: List[str]): for repo in repos: get_stars(repo)

In the code above, the function has been transformed into a flow which is named as “GitHub Stars”. This function is now within the constraints of Prefect orchestration laws. 

Now it must be noted that all workflows must be defined within the flow function. Likewise, all tasks must be called within the flow (function). Keep in mind that when a flow is executed, it is known as a flow run

Tasks

Tasks can be defined as specific work that needs to be executed, for instance, the addition of two numbers. In another word, tasks take an input, perform an operation and yield an output. Like flow, tasks can be imported and can be used as a decorator, @task, for a function. Once used for a function, it essentially wraps the function within the Prefect workflow and has similar advantages to the flow. For instance, it can automatically log information about task runs, such as runtime, tags, and final state. 

The code below demonstrates how a task is defined: 



#Source: https://github.com/PrefectHQ/prefect

@task(retries=3) def get_stars(repo: str): url = f"https://api.github.com/repos/{repo}" count = httpx.get(url).json()["stargazers_count"] print(f"{repo} has {count} stars!")

# run the flow! github_stars(["PrefectHQ/Prefect"])

To sum up, the flow looks for any task that is defined within its body, and once found it then creates a computational graph in the same order. It then creates dependencies between the tasks whenever the output of one task instance is used to yield output by another. 

Features

All three provide more or less the same features, but some features are better than others, and it also boils down to users’ adaptability. Just like in the previous section, let’s begin with a summary of the features. 

Airflow
Prefect

User Interface

It has a complete view of the workflow. You can define workflow straight from the UI.

Workflow is very well-maintained as it provides a number of different views.

Prefect is similar to Airflow.

Deployment Style 

Supports only Kubernetes-supported environments such as AWS and other S3-compatible services.

Supports Kubernetes-supported environment as well as other third-party environments.

Same as Airflow

Scalability

Parallel

Horizontal

Parallel

Accessibility

Open-sourced

Open-source

Open-sourced and subscription-based

Flexibility

Rigid

Rigid and Complicated

Flexible

Comparison of the features

Let’s start this section by exploring the User Interface. 

User Interface

For ease of use, Argo Workflow provides a web-based UI to define workflows and templates. The UI enables various purposes like:

  • Artifact visualization 
  • Using generated charts to compare Machine Learning pipelines
  • Visualizing results 
  • Debugging
  • It can also be used to define workflows
Argo user interface
Argo UI | Source

Airflow

Airflow UI provides a clean and efficient design that enables the user to interact with the Airflow server allowing them to monitor and troubleshoot the entire pipeline. It also allows editing the state of the task in the database and manipulating the behaviour of DAGs and tasks. 

Airflow user interface
Airflow UI | Source

The Airflow UI also provides various views for its users, they include:

  • DAGs View
  • Datasets View
  • Grid View
  • Graph View
  • Calendar View
  • Variable View
  • Gantt View
  • Task Duration
  • Code View

Prefect

Prefect like Airflow provides an overview of all the tasks, which helps you visualize all your workflow, tasks, and DAGs. It provides two ways to access UI:

  1. Prefect Cloud: It is hosted on the cloud, which enables you to configure your personal accounts and workspaces. 
  2. Prefect Orion UI: It is hosted locally, and it is also open-sourced. You cannot configure it the way you can with Prefect cloud. 
Prefect user interface
Prefect UI | Source

Some additional features of Prefect UI:

  • Displaying run summaries
  • Displaying flow details that are deployed
  • Scheduled flow 
  • Warnings notification for late and failed runs
  • Details information of tasks and workflows
  • Task dependency visualization and Radar flow
  • Logs details

Deployment Style

It is a native Kubernetes workflow engine which means it:

  • 1Runs on containers.
  • 2Runs on Kubernetes-supported pods.
  • 3Easy to deploy and scale.

On the downside:

  • Implementation is hard since it uses configurational language (YAML).

Airflow

  • 1Supports Kubernetes as well as other third–party integrations.
  • 2It runs on containers as well. 
  • 3Implementation is easy.

The downside of Airflow is:

  • It is not parallel scalable.
  • Deployment needs extra effort, which depends upon the cloud facility you choose. 

Prefect

Lastly, Prefect is a combination of both Argo and Airflow:

  • 1It can run on Containers and Kubernetes pods.
  • 2It is highly parallel and efficient.
  • 3It supports fault-tolerant scheduling.
  • 4Easy to deploy.
  • 5It also supports third-party integrations.

When it comes to the downside:

  • It does not support open-source deployment with Kubernetes. 
  • Deployment is difficult. 

Scalability

When it comes to scalability, Argo and Prefect are highly parallel, which makes them efficient and especially Prefect because it can leverage different third-party integrations support, making it the best of the three. 

Airflow, on the other, is horizontally scalable i.e., the number of active workers is equal to maximum task parallelism. 

Accessibility

All three are open-sourced, but Prefect also comes with a subscription-based service. 

Flexibility

Argo and Airflow aren’t that flexible when compared with Prefect as the former is Kubernetes-native it is confined in that environment, making it rigid, while the latter is complicated as it requires a well-defined and structured template, making itself not very well suited to an agile environment. 

Prefect, on the other hand, enables you to create dynamic dataflow in native Python, which does not require you to use DAG. All Python functions can be transformed to Prefect Flow and Task. This ensures flexibility.

Why use these tools?

So far, I’ve compared the basic concepts and features that these tools possess. Now let me give reasons as to why you can use any of these tools in your project.  

Here are some of the reasons why you should use Argo:

  • 1The Kubernetes native workflow tool enables you to run each step in its own Kubernetes pod.
  • 2Easy to scale because it can be executed parallelly. 
  • 3Workflow templates offer reusability.
  • 4Similarly, artifact integrations are also reusable. 
  • 5DAG is dynamic for each run of the workflow. 
  • 6Low Latency Scheduler.
  • 7Event-Driven Workflows.

Airflow

Reasons for you to use Airflow:

  • 1It enables users to connect with various technologies.
  • 2It offers rich scheduling and easy-to-define pipelines. 
  • 3Pythonic integration is another reason to use Airflow. 
  • 4You can create custom components as per your requirements.
  • 5Allows rollback to the previous version as workflows are stored.
  • 6Has a well-defined UI.
  • 7Multiple users can write a workflow for a given project, i.e. it is shareable. 

Prefect

Prefect is one of the well-planned orchestration tools for MLops. It is Python-native and requires you to put effort into the engineering side of things. One of the areas where Prefect shines is in data processing and pipeline. It can be used to fetch the data, apply the necessary transformation, and monitor and orchestrate necessary tasks.

When it comes to tasks related to machine learning, it can be used to automate the entire data flow. 

Some other reasons to use Prefect are:

  • 1Provides excellent security as it keeps your data and codes private.  
  • 2Enhanced UI and notification feature which directly comes to your email or Slack.
  • 3It can be used with Kubernetes and Docker. 
  • 4Efficient parallel processing of tasks.
  • 5Dynamic workflow.
  • 6Allows many third-party integrations.
  • 7Prefect uses GraphQL API, enabling it to trigger workflow on demand. 

How to decide?

Choosing the right tool for your project depends on what you want and what you already have. But I can surely put some criteria that can help you decide which tool will be appropriate for you. You can use –

  • If you want to set up a workflow based on Kubernetes.
  • If you want to define your workflow as DAGs.
  • If your dataset is huge and model training requires highly parallel and distributed training. 
  • If your task is complex.
  • If you are well-versed in YAML files. Even if you are not, learning YAML is not difficult.
  • If you want to use a cloud platform like GCD or AWS, which is Kubernetes enabled. 

Airflow

  • If you want to incorporate a lot of other 3rd party technology like Jenkins, Airbyte, Amazon, Cassandra, Docker, et cetera. Check the list of supported third-party extensions.
  • If you want to use Python to define the workflow.
  • If you want to define your workflow as DAGs.
  • If your workflow is static.
  • If you want a mature tool because Airflow is quite old. 
  • If you want to run tasks on schedule.

Prefect

  • If you want to incorporate a lot of other 3rd party technology.
  • If you want to use Python to define the workflow.
  • If your workflow is dynamic.
  • If you want to run tasks on schedule.
  • If you want something light and modern.

I found a thread on Reddit concerning the use of Airflow and Prefect. Maybe this can give you some additional information as to which tool to use.

“…The pros of Airflow are that it’s an established and popular project. This means it’s much easier to find someone who has done a random blog that answers your question. Another pro is that it’s much easier to hire someone with Airflow experience than Prefect experience. The cons are that Airflow’s age is showing, in that it wasn’t really designed for the kind of dynamic workflows that exist within modern data environments. If your company is going to be pushing the limits in terms of computation or complexity, I’d highly suggest looking at Prefect. Additionally, unless you go through Astronomer, if you can’t find an answer to a question you have about Airflow, you have to go through their fairly inactive slack chat.

The pros of Prefect are that it’s much more modern in its assumptions about what you’re doing and what it needs to do. It has an extensive API that allows you to programmatically control executions or otherwise interact with the scheduler, which I believe Airflow has only recently implemented out of beta in their 2.0 release. Prior to this, it was recommended not to use the API in production, which often leads to hacky workarounds. In addition, Prefect allows for a much more dynamic execution model with some of its concepts by determining the DAG that gets executed at runtime and then handing off the computation/optimization to other systems (namely Dask) to actually execute the tasks. I believe this is a much smarter approach, as I’ve seen workflows get more and more dynamic over the years.

If my company had neither Airflow nor Prefect in place already, I’d opt for Prefect. I believe it allows for much better modularization of code (which can then be tested more aggressively / thoroughly), which I already think is worth its weight in gold for data-driven companies that rely on having well-curated data in place to make automated product decisions. You can achieve something similar with Airflow, but you really need to go out of your way to make something like that happen, whereas in Prefect it kind of naturally comes out.” 

Here is a useful chart illustrating the popularity of different orchestration tools based on GitHub stars.

Chart illustrating the popularity of different orchestration tools
The popularity of different orchestration tools based on GitHub stars | Source

Conclusion

In this article, we discussed and compared the three popular tools for task orchestration, namely Argo, Airflow, and Prefect. My main aim was to help you understand these tools on the basis of three important factors i.e. Core concepts, Features offered, and why you should use them. The article also compared the three tools on some of the important features they offer, which could help you make the decision of choosing the most appropriate tool for your project.

I hope this article was informative and gave you a better understanding of these tools. 

Thanks!!! 

References

Nilesh Barla

Nilesh Barla

I am the founder of a recent startup perceptronai.net which aims to provide solutions in medical and material science through our deep learning algorithms. I also read and think a lot. And sometimes I put them in a form of a painting or a piece of music. And when I need to catch a breath I go for a run.

  • Follow me on

READ NEXT

MLOps: What It Is, Why it Matters, and How To Implement It

13 mins read | Prince Canuma | Posted January 14, 2021

According to techjury, every person created at least 1.7 MB of data per second in 2020. For data scientists like you and me, that is like early Christmas because there are so many theories/ideas to explore, experiment with, and many discoveries to be made and models to be developed. 

But if we want to be serious and actually have those models touch real-life business problems and real people, we have to deal with the essentials like:

  • acquiring & cleaning large amounts of data;
  • setting up tracking and versioning for experiments and model training runs;
  • setting up the deployment and monitoring pipelines for the models that do get to production. 

And we need to find a way to scale our ML operations to the needs of the business and/or users of our ML models.

There were similar issues in the past when we needed to scale conventional software systems so that more people can use them. DevOps’ solution was a set of practices for developing, testing, deploying, and operating large-scale software systems. With DevOps, development cycles became shorter, deployment velocity increased, and system releases became auditable and dependable.

That brings us to MLOps. It was born at the intersection of DevOpsData Engineering, and Machine Learning, and it’s a similar concept to DevOpsbut the execution is different. ML systems are experimental in nature and have more components that are significantly more complex to build and operate.

Let’s dig in!

Continue reading ->



About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK