Airflow – The Data Orchestrator

Airflow is our daily driver for our data pipelines in production since past 2 years. As we envision to moved away from a vendor locked Big Data ecosystem to pure Apache Big Data ecosystem, Airflow was the obvious choice for scheduling ETL pipelines. Since the inception of Airflow project in October 2014 by Maxime Beauchemin at Airbnb, it has garner a massive community following & support. During initial days, the project was part of Apache Software Foundation’s Incubator program in March 2016 & later in January 2019 it was announced as a top-level Apache project. The project is actively updated with loads of features & bug fixes in newer releases.

What is Airflow ?

Airflow is a workflow orchestration platform, it allows user to develop, schedule & monitor workflows. Airflow is built using Python and is quite a plug-able open-source platform with a wide range of technologies. The workflow itself can be defined as a Python code which can source data, connect to various data engines for transformation & processing. Over a period of usage of using have come across the following benefits:

  • Python make it easy to use even for someone who is at a beginner level in python coding.
  • Highly extensible platform due to a pluggable framework allowing integration with other technologies.
  • Strong opensource community support, you should see their documentation & release note section.
  • Extensive & informative user interface.

Where is Airflow used ?

Airflow is best suited for workflow/tasks which need to be scheduled, automated and can be defined using a python code.

  • Extract – Transform – Load related data operations
  • Business operations – generating business reports & emailing them to stake holders at specific time
  • Infrastructure Automation – Creating, Destroy, Modifying infrastructure, Data Backups
  • Data Science & MLOps – gathering data, preprocessing data, building data set

Airflow can be setup a single standalone instance or a multi node Airflow cluster with worker nodes for processing.

Before Airflow ?

Prior to Apache Airflow exploding as the go to choice for workflow orchestration there were various other tools with its own strengths and weaknesses:

Apache Oozie: Primarily designed for Hadoop workflows, it was well-suited for batch processing and data movement within the Hadoop ecosystem. However, it was not very flexible and less user-friendly for more generalized data orchestration tasks.

Luigi: Build by the Developers at Spotify, it was a simple tool for Python Based workflows. Workflow and task were encompassed as a docker image which limited it’s visibility.

Cron: A classic Linux scheduling tool that can be used to trigger scripts and commands at specific intervals. Not necessary a orchestration tool, but could be leveraged to spawn a script containing a basic workflow.

Jenkins: A best choice for CI/CD orchestration, but can also serve data centric pipeline. Data extraction & loading of data can be accomplished via writing groovy code but it can easily become complex to manage large scale workflows.

Benefits of Using Airflow

Programmatic Workflow Definition: Allows creating workflows using python code which can be version controlled via Git. This further enabled developers to share and collaborate on workflow design & flow.

Scalability and Flexibility: Airflow handle large workflow with an ease. Workflow can be as complex as fetching data from streaming queue & batch processing the same at regular interval before writing it to HDFS store or GCS. This functionality of handling varied workflow is enabled via Airflow operators, executors & queue.

Open-Source and Customizable: Being an opensource project benefits Airflow user due to its Zero licensing costs and further give them the ability to contribute or modify the source code of Airflow to suite their requirement.

Extensive Community and Support: Airflow has Active community for troubleshooting and feature requests & extensive documentation and tutorials.

Use cases of Airflow

Data Pipelines: Creating & managing data pipeline is best suited use case for Airflow – building tasks for ETL (Extract, Transform, Load) processes. Building workflow for Data warehousing and business intelligence reporting.

Machine Learning & AI Workflows : With support and integration with popular ML framework, Airflow is widely used in Model training, deployment, and monitoring.

DevOps and Automation: This has been largely an forte of Jenkins, but same can be leveraged for automating deployment scripts & continuous Integration/Continuous Deployment (CI/CD) pipelines.

Other Scenarios: include Report generation and scheduling report delivery, automation of file management and transfer.

In followup articles we will delve further in to architecture and product offering of Airflow.

Leave a Reply