Introduction to MLflow APIs for Experiment Tracking¶
MLflow is an open‐source platform designed to manage the end-to-end machine learning lifecycle. In this section, we introduce several key MLflow APIs that you can use to set up your experiment tracking, log runs and models, and manage your model registry. For more details, refer to the official MLflow documentation
This notebook provides a starting point on how to use MLFLow for logging your experiements and artifacts.
Before starting this notebook, make sure the services are running.
Setting Up the Tracking Environment¶
Before you begin logging experiments and models, you must configure the tracking server. MLflow uses a tracking URI to know where to store experiment data such as metrics, parameters, and artifacts.
======================================================
mlflow.set_tracking_uri(uri)
What it does: Configures the tracking server’s URI (this can be a local directory, a remote HTTP endpoint, or a database URI).
When to use: Call this at the very start of your script to ensure all logs go to the correct destination.
======================================================
mlflow.get_tracking_uri()
What it does: Retrieves the currently configured tracking URI, which is useful for debugging or verifying your setup.
======================================================
Managing Experiments¶
Experiments group related runs, making it easier to organize and compare different model training sessions.
======================================================
mlflow.create_experiment(experiment_name)
What it does: Explicitly creates a new experiment and returns its unique ID.
When to use: When you want to programmatically create an experiment without necessarily switching to it immediately.
======================================================
mlflow.set_experiment(experiment_name)
What it does: Sets the active experiment for your run. If the experiment doesn’t already exist, it will be created automatically.
When to use: At the beginning of your experiment code to ensure all subsequent runs are recorded under the correct experiment.
======================================================
Logging Runs and Enabling Auto-Logging¶
Once your tracking environment and experiment are set, you can start logging runs.
======================================================
mlflow.start_run()
What it does: Starts a new MLflow run and returns an ActiveRun object. It can be used as a context manager so that the run is automatically ended when the block finishes.
When to use: Wrap your training (or any experimental) code within a with mlflow.start_run(): block to log parameters, metrics, and artifacts.
Note: You can use one start_run inside another which forms a parent-child relationship between them leading to grouped experiments by setting `nested=True` inside the start_run of the child (useful for hyperparameter-tuning)
======================================================
mlflow.log_param(key, value)
Log a single parameter as a key-value pair. Parameters are typically hyperparameters or configuration values.
======================================================
mlflow.log_params(params_dict)
Log multiple parameters at once by passing a dictionary. Example:
params = {"learning_rate": 0.01, "num_layers": 3, "batch_size": 32}
with mlflow.start_run():
mlflow.log_params(params)
======================================================
mlflow.log_param(key, value)
Log a single parameter. Example:
with mlflow.start_run():
mlflow.log_param(key="data_source", value=s3_path)
======================================================
mlflow.autolog()
What it does: Automatically logs parameters, metrics, and models from supported machine learning libraries (such as scikit-learn, TensorFlow, PyTorch, etc.).
When to use: Use this at the beginning of your training code to minimize manual logging.
======================================================
mlflow.<framework>.log_model()
What it does: Logs a trained model as an artifact. Replace <framework> with the specific module (e.g., sklearn, tensorflow, or pytorch).
When to use: After you’ve trained your model, call this function to save the model so that it can later be served or compared.
======================================================
Loading Models¶
======================================================
mlflow.pyfunc.load_model()
What it does: Loads a model from a specified URI in a uniform “pyfunc” format that works for inference.
When to use: When you need to load a logged model for making predictions.
======================================================
Custom Python Model¶
======================================================
What is it: Most of the times, we require certain preprocessing and/or postprocessing before/after the prediction of the model. The model that mlflow creates for us by default only contains the model itself. If you need to add some pre/postprocessing steps, you can use custom python model to create model pipelines. Please see this link to learn more: https://mlflow.org/docs/latest/traditional-ml/creating-custom-pyfunc/index.html
======================================================
Below we apply the concepts above to an example ML project
import os
import itertools
from datetime import datetime
import boto3
import mlflow
import numpy as np
from dotenv import load_dotenv
import tensorflow.keras as keras
2025-02-11 15:34:36.011824: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
load_dotenv()
True
s3 = boto3.client(
"s3",
endpoint_url=os.getenv("J_MLFLOW_S3_ENDPOINT_URL"),
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY")
)
def get_or_create_experiment(experiment_name):
"""
Retrieve the ID of an existing MLflow experiment or create a new one if it doesn't exist.
This function checks if an experiment with the given name exists within MLflow.
If it does, the function returns its ID. If not, it creates a new experiment
with the provided name and returns its ID.
Taken from mlflow.org
Parameters:
- experiment_name (str): Name of the MLflow experiment.
Returns:
- str: ID of the existing or newly created MLflow experiment.
"""
if experiment := mlflow.get_experiment_by_name(experiment_name):
return experiment.experiment_id
else:
return mlflow.create_experiment(experiment_name)
def get_latest_data_path(
s3_client: boto3.client,
bucket_name: str,
base_folder: str = 'preprocessing'
) -> tuple[str, str]:
"""
Find the latest timestamp folder and NPZ file in the specified bucket/folder
Returns tuple of (full_path, filename)
"""
response = s3_client.list_objects_v2(
Bucket=bucket_name,
Prefix=f"{base_folder}/",
Delimiter='/'
)
timestamps = []
for prefix in response.get('CommonPrefixes', []):
folder_name = prefix['Prefix'].strip('/')
try:
timestamp = folder_name.replace(f"{base_folder}/", '')
timestamps.append(timestamp)
except ValueError:
continue
if not timestamps:
raise ValueError("No timestamp folders found")
latest_timestamp = sorted(timestamps)[-1]
latest_folder = f"{base_folder}/{latest_timestamp}"
response = s3_client.list_objects_v2(
Bucket=bucket_name,
Prefix=latest_folder
)
npz_files = [
obj['Key'] for obj in response.get('Contents', [])
if obj['Key'].endswith('.npz')
]
if not npz_files:
raise ValueError(f"No NPZ files found in {latest_folder}")
latest_file = npz_files[0]
return latest_file, latest_file.split('/')[-1]
def preprocess_and_store():
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0
X_train = np.expand_dims(X_train, axis=-1)
X_test = np.expand_dims(X_test, axis=-1)
local_path = f"/tmp/mnist_processed_{timestamp}.npz"
np.savez_compressed(local_path,
X_train=X_train, y_train=y_train,
X_test=X_test, y_test=y_test)
bucket_name = "mnist-data"
object_path = f"preprocessing/{timestamp}/mnist_processed.npz"
try:
s3.head_bucket(Bucket=bucket_name)
except NameError:
print(f"Bucket: {bucket_name} does not exist, creating one now!")
s3.create_bucket(Bucket=bucket_name)
s3.upload_file(local_path, bucket_name, object_path)
os.remove(local_path)
print(f"Preprocessed data stored to MinIO: {object_path}")
def train_mnist():
bucket_name="mnist-data"
base_folder="preprocessing"
s3_path, filename = get_latest_data_path(s3, bucket_name=bucket_name,
base_folder=base_folder)
local_path = "/tmp"
local_file = f"{local_path}/{filename}"
s3.download_file(bucket_name, s3_path, local_file)
data = np.load(local_file)
X_train, y_train = data['X_train'], data['y_train']
X_test, y_test = data['X_test'], data['y_test']
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
mlflow.set_tracking_uri(os.getenv("J_MLFLOW_TRACKING_URI"))
experiment_id=get_or_create_experiment("MNIST_Hyperparameter_Search_autolog")
mlflow.set_experiment(experiment_id=experiment_id)
best_accuracy = 0
best_model = None
best_params = {}
HYPERPARAM_GRID = {
'epochs': [1, 2]
}
keys, values = zip(*HYPERPARAM_GRID.items())
param_combinations = [dict(zip(keys, v)) for v in
itertools.product(*values)]
mlflow.autolog()
with mlflow.start_run(run_name="mnist-hyperparameter-tuning-parent"):
for params in param_combinations:
with mlflow.start_run(nested=True):
model = keras.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Flatten(),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
optimizer = keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(
X_train,
y_train,
epochs=params['epochs'],
validation_data=(X_test, y_test),
)
val_acc = history.history['val_accuracy'][-1]
if val_acc > best_accuracy:
best_accuracy = val_acc
best_model = model
best_params = params
if best_model is not None:
artifact_path = "mnist_model_autolog"
mlflow.tensorflow.log_model(model, artifact_path)
model_uri = mlflow.get_artifact_uri(artifact_path)
print("Model stored at ", model_uri)
print("Best Params ", best_params)
preprocess_and_store()
Preprocessed data stored to MinIO: preprocessing/20250211-153556/mnist_processed.npz
os.environ["MLFLOW_S3_ENDPOINT_URL"] = os.getenv("J_MLFLOW_S3_ENDPOINT_URL")
os.getenv("MLFLOW_S3_ENDPOINT_URL")
'http://localhost:9000'
train_mnist()
2025/02/11 15:36:07 INFO mlflow.bedrock: Enabled auto-tracing for Bedrock. Note that MLflow can only trace boto3 service clients that are created after this call. If you have already created one, please recreate the client by calling `boto3.client`. 2025/02/11 15:36:07 INFO mlflow.tracking.fluent: Autologging successfully enabled for boto3. 2025/02/11 15:36:07 INFO mlflow.tracking.fluent: Autologging successfully enabled for keras. 2025/02/11 15:36:08 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn. 2025/02/11 15:36:08 INFO mlflow.tracking.fluent: Autologging successfully enabled for tensorflow. /home/yogesh/miniforge3/envs/mlops/lib/python3.12/site-packages/keras/src/layers/convolutional/base_conv.py:107: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - accuracy: 0.9116 - loss: 0.2991
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 15s 8ms/step - accuracy: 0.9116 - loss: 0.2990 - val_accuracy: 0.9809 - val_loss: 0.0588 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 47ms/step
2025/02/11 15:36:30 WARNING mlflow.tensorflow: You are saving a TensorFlow Core model or Keras model without a signature. Inference with mlflow.pyfunc.spark_udf() will not work unless the model's pyfunc representation accepts pandas DataFrames as inference inputs.
🏃 View run beautiful-shrike-382 at: http://localhost:5000/#/experiments/1/runs/be9d32fc4943459b97cfc47bd616cb7b 🧪 View experiment at: http://localhost:5000/#/experiments/1
2025/02/11 15:36:34 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.
Model stored at s3://mlflow/1/2b0154bbb08246f4810b67f4992cfbf0/artifacts/mnist_model_autolog
Epoch 1/2 1872/1875 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - accuracy: 0.9130 - loss: 0.2876
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 0.9131 - loss: 0.2873 - val_accuracy: 0.9783 - val_loss: 0.0638 Epoch 2/2 1874/1875 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - accuracy: 0.9844 - loss: 0.0521
WARNING:absl:You are saving your model as an HDF5 file via `model.save()` or `keras.saving.save_model(model)`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')` or `keras.saving.save_model(model, 'my_model.keras')`.
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 14s 7ms/step - accuracy: 0.9844 - loss: 0.0521 - val_accuracy: 0.9863 - val_loss: 0.0439 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 59ms/step
2025/02/11 15:37:08 WARNING mlflow.tensorflow: You are saving a TensorFlow Core model or Keras model without a signature. Inference with mlflow.pyfunc.spark_udf() will not work unless the model's pyfunc representation accepts pandas DataFrames as inference inputs.
🏃 View run upset-zebra-240 at: http://localhost:5000/#/experiments/1/runs/baa90d4a50204344b9e821b3c46328b4 🧪 View experiment at: http://localhost:5000/#/experiments/1
2025/02/11 15:37:13 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.
Model stored at s3://mlflow/1/2b0154bbb08246f4810b67f4992cfbf0/artifacts/mnist_model_autolog 🏃 View run mnist-hyperparameter-tuning-parent at: http://localhost:5000/#/experiments/1/runs/2b0154bbb08246f4810b67f4992cfbf0 🧪 View experiment at: http://localhost:5000/#/experiments/1