6

Introduction to Apache Airflow and its Components

 2 years ago
source link: https://blog.knoldus.com/intro-apache-airflow-components/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Reading Time: 3 minutes
Apache Airflow

What is Apache Airflow ?

Apache Airflow is a free and open-source application for managing complicated workflows and data processing pipelines. It’s a platform for automating and monitoring workflows for scheduled jobs. It allows us to configure and schedule our processes according to our needs while simplifying and streamlining the process.

Why do we need Apache Airflow ?

Lets us assume a use case where we want to trigger a data pipeline every day at a given time. The data pipeline might include the following steps: Downloading of Data, Processing of Data, and finally Storing of Data.

Now to fulfill the above tasks the pipeline might make use of external API’s and Databases. Meanwhile, we have to make sure that these external Api’s and Databases are constantly available for use so as to make sure that the data pipeline will succeed.

But what is going to happen if the DB is not present or the API we are using to fetch the data is not present, the data pipeline is going to fail. This problem increases multi-folds if there are not one but hundreds of data pipelines operating simultaneously. This is exactly what Apache Airflow Addresses. With Airflow we can manage our data pipelines and execute our tasks in a very reliable way while monitoring our tasks and retrying them automatically.

Core components of Airflow

Airflow is a simple queueing system based on a metadata database. A scheduler uses the state of queued tasks stored in the database to prioritize how other tasks are added to the queue. There are four main components of Apache Airflow :

Web Server

The web server is in charge of providing the user interface. It also allows to track job status and read logs from remote file storage.

Scheduler

The scheduler handles scheduling the jobs, it decides which tasks to execute and when and where to execute them. It also decides the execution priority.

Metastore

Metastore is a database where all the metadata related to Airflow and our data is present. It powers how other components would interact with each other. Stores information regarding the state of each task.

Executor

The executor is a process that is tightly connected to the scheduler and determines the worker process which is actually going to execute the task.

Worker

Worker is the process where the tasks are executed.

Basic Apache Airflow concepts

Operators

A single task in a process is described by an operator and Operators are typically (but not always) nuclear, which means they can stand alone and do not require resources from other operators.

DAG (Directed acyclic graph)

DAG is a collection of small tasks which join together to perform a bigger task. It describes how to run a workflow. It is a collection of all the tasks we want to run organized in a manner that defines their relationship and dependencies.

This is what a standard DAG looks like. It has 4 tasks A, B, C, and D it defines in which order each task will execute and what are their dependencies.

Task

A Task is a basic unit of execution. Each task may have an upstream or a downstream dependency defined. The key point of using Tasks is defining how tasks are related to each other.

Operators

Operator is a template for a predefined task that we can declare inside a DAG.

Sensors

Sensors are special operators as they come into action after an event occurs. Types of Sensors: poke(default), reschedule, and smart sensor.

Benefits of using Apache Airflow

Dynamic

The Airflow dynamic pipeline is built in the form of code, giving it the ability to be dynamic.

Extensible

Another advantage of working with Airflow is that it is simple to start the operators and executors thus allowing the library to adapt to the level of abstraction required to serve a specific environment.

Scalable

Apache Airflow is highly scalable and we can execute as many tasks as we want in parallel.

User-Interface

With the help of UI we can monitor our data pipelines and retry our tasks accordingly.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK