Deployment of Apache Airflow Using Docker

Manas Arora

8 months ago

Apache Airflow has gained immense popularity as an open-source platform that allows you to programmatically author, schedule, and monitor workflows. However, configuring and managing Airflow manually can be complex. Docker simplifies this process by offering containerized environments for seamless setup and scalability. In this blog post, I’ll walk you through how to set up Airflow using Docker, along with customising the `airflow.cfg` configuration file.

Why Use Docker for Airflow?

Docker simplifies the deployment and scaling of Airflow by encapsulating it within isolated containers. With Docker Compose, you can easily spin up multiple components like the scheduler, web server, and worker with minimal effort.

When to Use Docker for Apache Airflow

Docker is an excellent choice for Apache Airflow in situations where consistency, scalability, and automation are priorities. Here’s when Docker-based Airflow shines:

Consistency Across Environments:
- Docker ensures that your Airflow setup behaves the same in development, testing, and production, minimising environment-specific issues.
Rapid Setup and Tear-down:
- Docker allows for quick spin-up of complete Airflow environments (web server, scheduler, database) using tools like Docker Compose, making it easy to prototype or test new features.
Scalable Components in Isolation:
- Running Airflow services in containers allows you to scale and isolate web server, scheduler, and workers individually, which can simplify scaling.
Microservice-Friendly:
- In microservice architectures, Docker makes it easy to containerize Airflow and integrate it seamlessly into a containerized ecosystem.
Version Control and Portability:
- Docker images allow for easier version management, testing new Airflow versions, and rolling out updates with minimal risk, while also providing portability across different infrastructures.
CI/CD Integration:
- Docker is well-suited for CI/CD pipelines, automating the testing and deployment process in isolated, reproducible environments.

When Not to Use Docker for Apache Airflow

Despite its strengths, Docker may not be the best option in some cases. Here’s when you might consider a native Airflow installation instead:

Heavy Resource Usage:
- For resource-intensive environments, Docker’s abstraction may add performance overhead, which is undesirable for tasks requiring optimal speed and memory usage.
Custom Resource Management:
- If you need very fine-tuned control over hardware resources (e.g., CPU, memory), a native installation might offer better optimization.
Persistent Data Challenges:
- Managing persistent data (e.g., databases, logs) within Docker can be more complex, especially in highly customised or high-availability environments.
Networking Complexity:
- Docker’s networking layer can complicate setups that require advanced networking configurations (e.g., VPNs, multi-system environments).
Native Infrastructure Compatibility:
- If your existing infrastructure is not containerized (e.g., traditional VMs or on-prem setups), integrating Docker might add unnecessary complexity.
Security Considerations:
- Docker shares the host’s kernel, so in environments where high-level isolation or strict security is needed, a native setup might be a better choice.

Key Benefits:

Isolation
1. Using Docker: Containers run independently, fully isolating Airflow from other system processes, preventing conflicts with other local installations or services.
2. Using Airflow cfg: Direct installation might lead to conflicts with other software versions (Python, libraries), requiring careful environment management.
Scalability
1. Using Docker: Scaling is simplified. You can easily increase the number of workers or instances by running more container replicas, and each component (web server, scheduler, workers) can be scaled independently.
2. Using Airflow cfg: Scaling requires manual setup and configuration, making it more complex to expand to a multi-node setup, especially when distributed across machines.
Reproducibility
1. Using Docker: Docker images ensure that Airflow behaves consistently across different environments. Whether you’re in development, testing, or production, the setup remains identical across machines and platforms.
2. Using Airflow.cfg: Maintaining consistent environments across different machines requires manual intervention (e.g., setting up virtual environments or managing dependencies).
Simplicity
1. Using Docker: Quick and easy setup. With pre-configured Docker images, there’s no need to worry about installing dependencies, configuring databases, or dealing with system-level packages. You can spin up a full Airflow environment with minimal effort.
2. Using Airflow.cfg: Requires more hands-on configuration, such as manually setting up databases, installing required libraries, and configuring services, which can become cumbersome.

Step-by-Step Guide to Implementing Airflow with Docker

Step 1: Installing Docker and Docker Compose

Before setting up Airflow, make sure you have Docker and Docker Compose installed on your system.

Install Docker
Install Docker Compose

Once installed, you can verify your installations using the following commands:

docker --version

docker-compose --version

Step 2: Set Up the Directory Structure

Create a directory where all the Airflow-related files will reside:

mkdir airflow_docker
cd airflow_docker

Inside this directory, create the following folders:

– dags/: Store your DAG files.

– logs/: Store logs generated by Airflow.

– plugins/: Store custom plugins.
– configs/ : Store environment configuration

mkdir -p ./dags ./logs ./plugins ./config

Step 3: Creating docker-compose.yaml File

To orchestrate multiple Airflow services (web server, scheduler, worker, etc.), create a docker-compose.yaml file:

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.9.1/docker-compose.yaml'Code language: JavaScript (javascript)

This file sets up the core services for Airflow: a PostgreSQL database, Web Server, and Scheduler, Worker, Triggerer and Airflow-init.

Step 4: Run docker daemon

For Linux –

Start the Docker daemon:

sudo systemctl start docker

Enable Docker to start automatically at boot (optional):

sudo systemctl enable docker

Check the status of the Docker daemon:

sudo systemctl status docker

Stop the Docker daemon (if needed):

sudo systemctl status docker

To start or stop the Docker daemon, simply launch or quit Docker Desktop via the user interface.

Step 5: Initialise the Airflow Database

Run the following commands to initialise the database and get Airflow ready:

docker-compose up airflow-init

This will create necessary tables in the PostgreSQL database and set up your environment.

Step 5: Running Airflow

Once the initialization is complete, you can spin up the Airflow services using:

docker-compose up -d

This will start all the containers, and you can access the Airflow UI at http://localhost:8080.

Step 6: Accessing Airflow UI

After the docker container is running, you will see the below UI on http://localhost:8080

For login you can find the password in the docker-compose file, which are named as AIRFLOW_WWW_USER_USERNAME and AIRFLOW_WWW_USER_PASSWORD. By default these are set to airflow

Once you login, you will see multiple dag examples

Step 7: Testing Your Setup

Create a basic DAG to test your Airflow setup. Add the following Python script in the dags/ folder:

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2023, 1, 1),
    'retries': 1,
}

with DAG('test_dag', default_args=default_args, schedule_interval='@daily', catchup=False) as dag:
    task = DummyOperator(task_id='dummy_task')Code language: JavaScript (javascript)

Once you place this file in the dags/ folder, navigate to the Airflow UI at `http://localhost:8080`. You should see the DAG appear.

Adding environment variables in Airflow with Docker

Step 1: Creating an .env file

Earlier we created a config folder, you can add .env file there and all your environment variables in it

Step 2: Adding variables in docker compose

Inside your docker compose you will find x-airflow-common these are variables which are common and are passed to all the required services

In this example, we have added AIRFLOW__CORE__LOAD_EXAMPLES as env variable and have stored its value as shown below

Step 3: Passing env variables to different services

If you have created a new service in you env file then you just need to add

<<: *airflow-common-env after environment:

This will ensure that all your common variables are added for the service

Step 4: Running code with environment variables

To run your code you need to add the env-file flag and path to the env file with docker-compose commands

docker-compose --env-file=config/.env up -d
docker-compose --env-file=config/.env down -v

Conclusion

By using Docker, you can quickly set up and configure Apache Airflow. Docker makes it easier to manage your environment, while `airflow.cfg` allows fine-tuning of how Airflow behaves. This method also offers flexibility and scalability for both small and large workflows.

Feel free to expand your Airflow environment by adding more services or workers, depending on your needs, all within the simplicity of Docker.

Happy Workflow Automation!

Author

Manas Arora

View all posts