Search
  • Igli

Deploy Airflow and Metabase in Kubernetes using Infrastructure-as-Code

A step-by-step guide to deploying Airflow and Metabase in GCP with Terraform and Helm providers.



With the extensive usage of cloud platforms nowadays, many companies rely on different tools to automate their everyday routine tasks. Terraform is one of them. It is concerned with managing cloud resources as code that, once written, can be reused in multiple environments and regions following the DRY (Don't Repeat Yourself) standard. Versioning of this written configuration helps us keep track of the resources' state and changes applied to them.


The aim of this post is to provide a practical example for managing the entire infrastructure lifespan with Terraform and GCP (Google Cloud Platform) for deploying Airflow and Metabase in Kubernetes.


These technologies support the development of better, more flexible, and collaborative solutions that promote the use of best practice software engineering methods and avoid code collisions.

In this blog post we are going to show how to:


To simplify creation of the required resources for the infrastructure we will be using the official Terraform modules for Google Cloud, primarily terraform-google-kubernetes-engine and terraform-google-sql-db.


For Airflow and Metabase deployment, we will be using Terraform Helm Provider so that the entire infrastructure and application deployment is as streamlined as possible.


Prerequisites

  • At least one GCP project. This part can not be created with Terraform if the GCP account is not part of an organization.

  • A bucket for the Terraform backend. Remote storage is required to store the Terraform state.

  • A service account for Terraform to authenticate to GCP.


You can find the resources and code used in this post in our GitHub repository.


Infrastructure Setup


1. GCP Services


Note that Cloud Resource Manager API should be enabled first before enabling the rest of the services.


2. VPC Network


Here we set up a VPC with a single subnet and two secondary CIDR IP ranges for GKE pods and GKE services.


3. GKE Cluster



Certain values in the gke module code are referenced from the network module. This is not only helpful in retrieving data but also creates a dependency between these modules.


This method is also used above to enable the Resource Manager API first.


4. Cloud SQL Postgres Instance



The snippet above is an example of how to set up a VPC peering connection for a private Postgres instance.


There are two additional databases airflow-db and metabase-db for Airflow and Metabase respectively.

There is also an additional user which we use to connect to this instance.


To ensure security best practices we generate and store database user credentials in GCP Secret Manager. However these credentials will still be in plain text in Terraform state and for the time being, there isn't any way to avoid that.


Tips: Make sure to:

  1. Store your state in a backend (Google Cloud Storage in this case)

  2. Strictly control who has access to the Terraform backend (ideally only the terraform service account)


5. Installing Helm Charts with Terraform Helm Provider


We will use Terraform Helm Provider to connect to the newly created GKE cluster and install Airflow and Metabase Helm charts.


Set up git-sync for Airflow Dags


We need git-sync to import DAGs from a GitHub repository.


To enable git-sync for Airflow:

  1. Generate an RSA SSH key

  2. Set the public key on the airflow dags repository

  3. Set the base64 encrypted private key as an environment variable named TF_VAR_airflow_gitSshKey

  4. Additional information such as repo, branch, and subPath can be set in the tfvars file

Here we are taking advantage of Terraform’s native support for reading environment variables to store our private SSH key.


At this point, to provision the infrastructure and deploy Airflow and Metabase all that needs to be done is run:


terraform init
terraform plan
terraform apply

Airflow and Metabase can be accessed at endpoint information provided in:

Kubernetes Engine > Services & Ingress


Deploying Multiple Environments with Terraform


Terraform workspaces are the successor to Terraform environments. Workspaces allow you to separate your state and infrastructure without changing anything in your code. Each workspace should be considered as a separate environment and have its own variables.


CI/CD with Github Actions


In this post, we use Git Actions for CI/CD. However, there are many options to achieve this, such as Jenkins, Atlantis, CicleCI, and TerraformCloud. We prefer Github actions because it is simpler and a lot of the automation at Data Max is based on it.

Information on how to set up workflows hashicorp/setup-terraform.


To be able to provision infrastructure in multiple environments, there can be different actions triggered on different branches by:

  1. Configuring another terraform workspace

  2. Passing the appropriate values and credentials


Congratulations, you’ve successfully deployed a Kubernetes GKE cluster using Terraform and installed Airflow and Metabase using Terraform Helm Provider!


Conclusions


The above-mentioned solution is just one of many approaches to this problem. The most notable advantages of this infrastructure implementation are the reduced errors and the ability to reproduce the same infrastructure effortlessly with little to no human interference.



For any questions, feel free to reach out to us at hello@data-max.io. We would love to hear from you.