Azure Batch to handle embarrassingly parallel machine learning workload

Azure Batch

Azure Batch service provides a way to handle embarrassingly parallel machine learning workload with a minimum effort.

What’s embarrassingly parallel?

Parallel computing, a paradigm in computing which has multiple tasks running simultaneously, might contain what is known as an embarrassingly parallel workload or problem (also called perfectly parallel, delightfully parallel or pleasingly parallel). An embarrassingly parallel task can be considered a trivial case - little or no manipulation is needed to separate the problem into a number of parallel tasks. This is often the case where there is little or no dependency or need for communication between those parallel tasks, or for results between them.

What’s Azure Batch?

Use Azure Batch to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently in Azure. Azure Batch creates and manages a pool of compute nodes (virtual machines), installs the applications you want to run, and schedules jobs to run on the nodes. There’s no cluster or job scheduler software to install, manage, or scale. Instead, you use Batch APIs and tools, command-line scripts, or the Azure portal to configure, manage, and monitor your jobs.

See more details

How Azure batch works.

Set up Azure batch

Follow the steps: Run a Batch job - Python

Basically you need to create:

A Batch account
A linked Azure Storage account

Create a new batch account with a linked Azure Storage account.

Batch account overview.

A machine learning example

The codes can be found: git clone git@github.com:abinitio888/azure_batch.git

Codebase skeleton.

App folder contains the machine learning codes.
InputsFiles folder contains the input data for each scenario.
batch_sklearn.py is the Batch API code to control the Batch services.
config.py contains the secrets to Batch.

Codebase skeleton.

Main application. Under the App folder, a machine learning solution can be structured.

# main.py

import argparse
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="App main entry")
    parser.add_argument("input_path", type=str, help="Job input path")
    parser.add_argument("output_path", type=str, help="Job output path")
    args = parser.parse_args()

    fd = open(args.input_path)
    df = pd.read_csv(fd, index_col=0)
    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]

    clf = RandomForestClassifier(n_estimators=10, max_depth=None,
            min_samples_split=2, random_state=0)
    scores = cross_val_score(clf, X, y, cv=5)
    mean_score = scores.mean()
    with open(args.output_path, "w") as fd:
        fd.write(f"The model score: {mean_score}")

Install the package dependence. The package dependence can be installed in the batch_sklearn.py with requirements.txt.

# batch_sklearn.py

command = "/bin/bash -c \""\
    "python3.7 -m pip install --upgrade pip && "\
    f"python3.7 -m pip install -r {config._APP_PATH}/requirements.txt && "\
    f"python3.7 {config._APP_MAIN_SCRIPT} {input_file_path} {output_file_path}"\
    "\""

Run the example

Kick off script. The batch_sklearn.py provides step-wise instructions to control the Azure Batch job.

Workflow.

Inputs to Batch. Both the data InputFiles and code App will be uploaded to the linked Azure Storage.

The data and scripts are uploaded to Storage.

Pool creation. Once the pool is created, you can check the status of the pool and the nodes information.

Pool dashboard.

Nodes information.

Job creation. Once the job is created, you can check the status of the job and the tasks.

Job dashboard.

Overview of the tasks.

For each task, the outputs contains stderr.txt and stdout.txt.

Details of a task.

Outputs from Batch.

Outputs to Storage.

Summary

Choosing Azure compute platforms for container-based applications

Choosing Azure compute platforms.

Azure Container Instance vs AKS vs Azure Batch

Why Azure batch is recommened over Azure Container Instances and AKS for batch jobs?

One opinion: https://www.linkedin.com/pulse/machine-learning-containers-scale-azure-batch-kubernetes-josh-lane/

Azure Container Instances is a great solution for any scenario that can operate in isolated containers, including simple applications, task automation, and build jobs. For scenarios where you need full container orchestration, including service discovery across multiple containers, automatic scaling, and coordinated application upgrades, we recommend Azure Kubernetes Service (AKS).

Use Azure Container Instances to run serverless Docker containers in Azure with simplicity and speed.
Deploy to a container instance on-demand when you develop cloud-native apps and you want to switch seamlessly from local development to cloud deployment.