Getting Started with GaiaFlow

This guide will help you set up your environment, install dependencies, and generate your first project using the GaiaFlow template.

The following section provides an overview of the services provided by the Gaiaflow framework. For first-time users, we recommend that you read/skim through this to get an idea of what Gaiaflow can currently offer.

If you would like to start right away, then get started here

Overview

Gaiaflow provides a standardized project structure for ML initiatives at BC, integrating essential MLOps tools:

Apache Airflow: For orchestrating ML pipelines and workflows
MLflow: For experiment tracking and model registry
JupyterLab: For interactive development and experimentation
MinIO: For local object storage for ML artifacts
Minikube: For local lightweight Kubernetes cluster

MLOps Components

Before you get started, let's explore the tools that we are using for this standardized MLOps framework

0. Cookiecutter

Purpose: Project scaffolding and template generation

Provides a standardized way to create ML projects with predefined structures.
Ensures consistency across different ML projects within BC

1. Apache Airflow

Purpose: Workflow orchestration

Manages and schedules data pipelines.
Automates end-to-end ML workflows, including data ingestion, training, deployment and re-training.
Provides a user-friendly web interface for tracking task execution's status.

Key Concepts in Airflow

DAG (Directed Acyclic Graph)

A DAG is a collection of tasks organized in a structure that reflects their execution order. DAGs do not allow for loops, ensuring deterministic scheduling and execution.

Task

A Task represents a single unit of work within a DAG. Each task is an instance of an Operator. Gaiaflow provides task_factory for the ease of defining tasks in a DAG.

Operator

Operators define the type of work to be done. They are templates that encapsulate logic.

Common Operators:

PythonOperator: Executes a Python function.
BashOperator: Executes bash commands.
KubernetesPodOperator: Executes code inside a Kubernetes pod.
DummyOperator: No operation — used for DAG design.

For ease of use of Airflow, we have created a wrapper task_factory that allows the user to create these tasks without worrying about which operator to use. Read more here

Scheduler

The scheduler is responsible for triggering DAG runs based on a schedule. It evaluates DAGs, resolves dependencies, and queues tasks for execution.

XCom (Cross-Communication)

A lightweight mechanism for passing small data between tasks. Data is stored in Airflow’s metadata DB and fetched using Jinja templates or Python.

Airflow UI

DAGs (Directed Acyclic Graphs): A workflow representation in Airflow. You can enable, disable, and trigger DAGs from the UI.
Graph View: Visual representation of task dependencies.
Tree View: Displays DAG execution history over time.
Task Instance: A single execution of a task in a DAG.
Logs: Each task's execution details and errors.
Code View: Shows the Python code of a DAG.
Trigger DAG: Manually start a DAG run.
Pause DAG: Stops automatic DAG execution.

Common Actions

Enable a DAG: Toggle the On/Off button.
Manually trigger a DAG: Click Trigger DAG ▶️.
View logs: Click on a task instance and select Logs.
Restart a failed task: Click Clear to rerun a specific task.

2. MLflow

Purpose: Experiment tracking and model management

Tracks and records machine learning experiments, including hyperparameters, performance metrics, and model artifacts.
Facilitates model versioning and reproducibility.
Supports multiple deployment targets, including cloud platforms, Kubernetes, and on-premises environments.

Core Components

Tracking

Allows logging of metrics, parameters, artifacts, and models for every experiment.

Models

MLflow models are saved in a standard format that supports deployment to various serving platform.

Model Registry

Central hub for managing ML models where one can register and version models.

MLFlow UI

Experiments: Group of runs tracking different versions of ML models.
Runs: A single execution of an ML experiment with logged parameters, metrics, and artifacts.
Parameters: Hyperparameters or inputs logged during training.
Metrics: Performance indicators like accuracy or loss.
Artifacts: Files such as models, logs, or plots.
Model Registry: Centralized storage for trained models with versioning.

Common Actions

View experiment runs: Go to Experiments > Select an experiment
Compare runs: Select multiple runs and click Compare.
View parameters and metrics: Click on a run to see details.
View registered model: Under Artifacts, select a model and click Register Model.

For a quick MLFLow tutorial, see more here

3. JupyterLab

Purpose: Interactive development environment

Provides an intuitive and interactive web-based interface for exploratory data analysis, visualization, and model development.

4. MinIO

Purpose: Object storage for ML artifacts

Acts as a cloud-native storage solution for datasets and models.
Provides an S3-compatible API for seamless integration with ML tools.
Suitable for Local development iterations using a portion of the data

5. Minikube

Purpose: Local Kubernetes cluster for development & testing

Allows you to run a single-node Kubernetes cluster locally.
Simulates a production-like environment to test Airflow DAGs end-to-end.
Great for validating KubernetesExecutor, and Dockerized task behavior before deploying to a real cluster.
Mimics production deployment without the cost or risk of real cloud infrastructure.

6. Task Factory

Purpose

Since this template is mainly for Python related packages, to make it easier for users adopting airflow, we provide task_factory which is a unified task generator that abstracts away the complexity of choosing the right Airflow operator depending on the deployment environment (dev, prod_local, or prod). Its goal is to:

Simplify task definitions
Enable environment switching (dev, prod, prod_local)
Unify local Python-based execution and Kubernetes-based execution
Support XCom pulling, secret injection, and environment variables

So users will use task_factory to create tasks in an Airflow DAG.

NOTE: This however does not restrict the user from using a different operator of their choice, but they would need to make sure that it works in both dev and prod mode.

Task factory parameters

Parameter	Type	Description
`task_id`	`str`	Unique task identifier in all modes
`func_path`	`str`	Module path to the function (e.g., `my.module:func`) in all modes
`func_kwargs`	`dict`	Function arguments in all modes
`image`	`str`	Docker image to use in `prod`/`prod_local` mode
`env`	`str`	One of `dev`, `prod`, `prod_local`
`xcom_push`	`bool`	Whether to push result to XCom in all modes
`xcom_pull_tasks`	`dict`	Tasks and keys to pull from XCom in all modes
`secrets`	`list`	Kubernetes secret names in `prod`/`prod_local` mode
`env_vars`	`dict`	Extra environment variables in `prod`/`prod_local` mode
`retries`	`int`	Number of times the task must be retried. Default is 2. For all modes

Behavior by Environment:

`dev` mode:

Local development mode.
Uses PythonOperator.
Fastest for iteration.
Only task_id, func_path, func_kwargs, and env are used.
image, secrets, env_vars are ignored.

`prod_local` mode:

Simulates production on your machine using Minikube + KubernetesPodOperator.
Requires:
image: Docker image (built locally from your project).
secrets: For injecting secure credentials (e.g., to connect to MinIO or databases).
env_vars: Environment-specific config (e.g., MLFLOW and S3 URIs).
You must run MinikubeManager beforehand

`prod` mode:

Uses KubernetesPodOperator
Production environment in AWS Kubernetes cluster.
Same requirements as prod_local, but deployed to remote infra.

Use cases for environments:

Environment	Use Case
`dev`	Local testing with `PythonOperator` (no Docker or Kubernetes needed)
`prod_local`	Testing production-like behavior on your machine via Minikube with `KubernetesPodOperator`
`prod`	Fully production environment on Kubernetes with `KubernetesPodOperator`

Supported Operators:

Operator	Used In	Purpose
`PythonOperator`	`dev`	Runs functions directly in the scheduler
`KubernetesPodOperator`	`prod_local`, `prod`	Runs tasks in isolated Docker containers via Kubernetes/Minikube

secrets

Used only in prod_local and prod modes.
Reference to Kubernetes secrets.
Passed to KubernetesPodOperator to inject secure credentials (e.g., MinIO access keys).
Defined in your Kubernetes environment (Minikube or production cluster).
Injected as environment variables using env_from.

env_vars

Used only in prod_local and prod modes.
Dictionary of environment variables needed at runtime.
Used to connect to services like:
MLflow Tracking Server (MLFLOW_TRACKING_URI)
MinIO or S3-compatible storage (MLFLOW_S3_ENDPOINT_URL)
In prod_local, these point to your local Minikube (e.g., 192.168.49.1).
In prod, they point to production service URLs.

XCom Support (For all modes)

You can pass results between tasks using xcom_pull_tasks, regardless of mode. It is used to pull data from the output (return_value) of a previous task using Airflow’s XCom (Cross-Communication) mechanism.

NOTE: - In dev, values are pulled directly from Airflow’s XCom system. - In prod/prod_local, values are injected into the pod as environment variables using templated Jinja strings and parsed at runtime by your application.

Structure:

xcom_pull_tasks={
    "argument_name": {
        "task": "<task_group.task_id>" | "<task_id>",  # Fully qualified task name. include task group if any
        "key": "return_value",  # Optional, defaults to return_value
    },
}

The keys in the dictionary become arguments passed to your function.
The value is a reference to another task's output (via XCom).
It works in all modes (dev, prod_local, prod), because task_factory handles the internals.

Example:

xcom_pull_tasks={
    "preprocessed_path": {
        "task": "Trainer.preprocess_data",
        "key": "return_value"
    }
}

This pulls the result of preprocess_data from the Trainer task group and passes it to the function that is invoked in the task.

Architecture

The Gaiaflow environment provides a local-first MLOps setup, enabling rapid development and production-like testing of Airflow DAGs along with a scaffolding of your python package.

After you develop your python package, you can create a reproducible and scalable workflow using Airflow which is provided via the Gaiaflow framework.

To make it easier for you to create DAGs, we provide you with task_factory, an abstraction over the Airflow operators to let you create DAGs in a more user-friendly way allowing you to switch between environments like dev and prod_local to test your dags seamlessly.

You can read more about task_factory here

The following managers are provided to help setup the Gaiaflow framework on your machine.

Service	Description
`MlopsManager`	Spins up MLOps services (Airflow, MLflow, etc.) for local development.
`MinikubeManager`	Spins up a local Kubernetes cluster for testing DAGs in a production-like environment before deploying.

Gaiaflow framework can run in three modes after you have your python package ready to be deployed as a workflow:

Local development mode when `task_factory`'s `ENVIRONMENT` is set to `dev`

In this mode, you can access Airflow, MLflow and Minio services via localhost. Any changes that you make in the dags that use your package or the package itself, will be promptly reflected in the Airflow UI allowing a faster iterative development cycle.

For a reproducible and shareable ML project, we also provide MLFlow where you can log your experiments and models.

For providing the same API as S3 while avoiding the egress costs during development and testing, we provide MinIO which acts as a local object storage.

Local production like environment for testing when `task_factory`'s `ENVIRONMENT` is set to `prod_local`

The same services are available as in the dev mode.

But to test your workflow in a production-like environment, we use Minikube local lightweight cluster which runs your tasks of your DAGs in an Kubernetes environment (which is how the DAGs in production run).

Pushing dags to production when `task_factory`'s `ENVIRONMENT` is set to `prod`

In this mode, no more local services are useful, but instead a production Airflow, MLFlow and S3 is used once you create a release of your project.

As you can see in the diagram below, once you create a release, it triggers a CI that allows the CDR (Centralized Dag Repository) to pull your DAGs. The Airflow running on AWS is Gitsynced with this repository which allows your DAGs to be visible in the UI.

To allow for your package to run in any environment, we create Docker images that are pushed to ECR and then accessed by the Airflow tasks.

MlopsManager

Manages local MLOps components for day-to-day development and experimentation.

Service	Description	Default Port(s)
Airflow	DAG orchestration & scheduling platform	`8080` (Web UI)
MLflow	ML lifecycle management (tracking, models)	`5000` (UI & API)
MinIO	Object storage (S3-compatible)	`9000` (S3 API), `9001` (UI)

When to Use It

Use Mlops Manager when:

You are testing Jupyter code with MLFlow/Minio
You're iterating on pipeline code after creating your software package.
You want to test Airflow DAGs, Jupyter notebooks, or other services locally.
You need fast feedback loops before deploying to Minikube or production.
You are using task_factory in dev mode

CLI Reference

python gaiaflow_manager.py [OPTIONS]

Flag / Option	Description	Example
`--start`	Start selected services	`--start --service airflow`
`--stop`	Stop selected services	`--stop --service mlflow`
`--restart`	Restart selected services	`--restart --service airflow`
`--service`	Choose service: `airflow`, `mlflow`, `jupyter`	`--service jupyter`
`-c`, `--cache`	Use Docker cache during image builds	`--start -c`
`-j`, `--jupyter-port`	Port to expose Jupyter Notebook (default: 8895)	`--start -j 8896`
`-v`, `--delete-volume`	Delete Docker volumes on shutdown	`--stop -v`
`-b`, `--docker-build`	Force rebuild Docker image	`--start -b`

MinikubeManager

Used to create a lightweight local Kubernetes cluster that simulates production, for validating DAGs using KubernetesExecutor or custom pods.

Component	Description
Minikube	Local, single-node Kubernetes cluster

When to Use It

Use Minikube Manager when:

You want to test Docker images and airflow tasks in a local Kubernetes environment.
You’re preparing for deployment to production Airflow on AWS.
You're using task_factory in prod_local mode.

CLI Reference

python minikube_manager.py [OPTIONS]

Flag / Option	Description	Example
`--start`	Start the Minikube cluster	`--start`
`-s`, `--stop`	Stop the Minikube cluster	`-s`
`-r`, `--restart`	Restart the Minikube cluster	`-r`
`--build-only`	Only build the Docker image inside the Minikube context	`--build-only`
`--create-config-only`	Generate inline config for use in Docker Compose	`--create-config-only`
`--create-secrets`	(might be removed) Create Kubernetes secrets for use in pods	`--create-secrets`

Managing Secrets (TODO: talk with Tejas)

You can now create any secrets dynamically:

create_secrets(secret_name="my-creds", secret_data={"API_KEY": "1234", "ENV": "dev"})

This creates Kubernetes secrets inside the Minikube cluster.

Best Practice: Use environment variables or secure vaults in production. Avoid hardcoding secrets.

Image Naming Best Practices (TODO: talk with Tejas)

We recommend using:

<project-name>/<package-name>:<version>

Where version is fetched from your Python package's __version__.

Getting Started

Please make sure that you install the following from the links provided as they have been tried and tested.

If you face any issues, please check out the troubleshooting section

Prerequisites

Docker and Docker Compose
Mamba - Please make sure you install Python 3.12 as this repository has been tested with that version.
Minikube on Linux
Minikube on Windows

Docker and Docker compose plugin Installation

For Linux users: please follow the steps mentioned in this link

For Windows users: please follow the steps mentioned in this link

This should install both Docker and Docker compose plugin. You can verify the installation by these commands

   docker --version
   docker compose version

and output would be something like:

  Docker version 27.5.1, build 9f9e405
  Docker Compose version v2.32.4

This means now you have successfully installed Docker.

Once the pre-requisites are done, you can go ahead with the project creation:

Create a separate environment for cookiecutter

  mamba create -n cc cookiecutter ruamel.yaml
  mamba activate cc

Generate the project from template:

  cookiecutter https://github.com/bcdev/gaiaflow

When prompted for input, enter the details requested. If you dont provide any input for a given choice, the first choice from the list is taken as the default.

Once the project is created, please follow along here.

Troubleshooting

If you are windows, please use the miniforge prompt commandline.
If you face issue like Docker Daemon not started, start it using:
```
  sudo systemctl start docker
```
and try the docker commands again in a new terminal.
If you face an issue as follows: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock:, do the following
```
  sudo chmod 666 /var/run/docker.sock
```
and try the docker commands again in a new terminal.
If you face an issue like Cannot connect to the Docker daemon at unix:///home//.docker/desktop/docker.sock. Is the docker daemon running?, it is likely because of you have two contexts of docker running.

To view the docker contexts,

   docker context ls

This will show the list of docker contexts. Check if default is enabled (it should have a * beside it) If not, you might probably have desktop as your context enabled. To confirm which context you are in:

   docker context show

To use the default context, do this:

   docker context use default

Check for the following file:

  cat ~/.docker/config.json

If it is empty, all good, if not, it might be something like this:

  {
    "auths": {},
    "credsStore": "desktop"
  }

Completely move this file away from this location or delete it and try running docker again.

If you face some permissions issues on some files like Permission Denied, as a workaround, please use this and let us know so that we can update this repo.
```
  sudo chmod 666 <your-filename> 
```

If you face any other problems not mentioned above, please reach out to us.

Getting Started with GaiaFlow

Overview

MLOps Components

0. Cookiecutter

1. Apache Airflow

Key Concepts in Airflow

DAG (Directed Acyclic Graph)

Task

Operator

Scheduler

XCom (Cross-Communication)

Airflow UI

2. MLflow

Core Components

Tracking

Models

Model Registry

MLFlow UI

3. JupyterLab

4. MinIO

5. Minikube

6. Task Factory

Purpose

Task factory parameters

Behavior by Environment:

dev mode:

prod_local mode:

prod mode:

Use cases for environments:

Supported Operators:

secrets

env_vars

XCom Support (For all modes)

Structure:

Architecture

Local development mode when task_factory's ENVIRONMENT is set to dev

Local production like environment for testing when task_factory's ENVIRONMENT is set to prod_local

Pushing dags to production when task_factory's ENVIRONMENT is set to prod

MlopsManager

When to Use It

CLI Reference

MinikubeManager

When to Use It

CLI Reference

Managing Secrets (TODO: talk with Tejas)

Image Naming Best Practices (TODO: talk with Tejas)

Getting Started

Prerequisites

Docker and Docker compose plugin Installation

Troubleshooting

`dev` mode:

`prod_local` mode:

`prod` mode:

Local development mode when `task_factory`'s `ENVIRONMENT` is set to `dev`

Local production like environment for testing when `task_factory`'s `ENVIRONMENT` is set to `prod_local`

Pushing dags to production when `task_factory`'s `ENVIRONMENT` is set to `prod`