Home » How to Benchmark Classical Machine Learning Workloads on Google Cloud

How to Benchmark Classical Machine Learning Workloads on Google Cloud

Machine Learning Still Matters

In an era of GPU supremacy, why do real-world business cases depend so much on classical machine learning and CPU-based training? The answer is that the data most important to real-world business applications is still overwhelmingly tabular, structured, and relational—think fraud detection, insurance risk scoring, churn prediction, and operational telemetry. Empirical results (e.g., Grinsztajn et al., Why do tree-based models still outperform deep learning on typical tabular data? (2022), NeurIPS 2022 Track on Datasets and Benchmarks) show that for these domains random forest, gradient boosting, and logistic regression outperform neural nets in both accuracy and reliability. They also offer explainability which is critical in regulated industries like banking and healthcare.

GPUs often lose their edge here due to data transfer latency (PCIe overhead) and poor scaling of some tree-based algorithms. Consequently, CPU-based training remains the most cost-effective choice for small-medium structured data workloads on cloud platforms.

In this article, I’ll walk you through the steps for benchmarking traditional machine learning algorithms on Google Cloud Platform (GCP) CPU offerings, including the Intel® Xeon® 6 that was recently made generally available. (Full disclosure: I am affiliated with Intel as a Senior AI Software Solutions Engineer.)

By systematically comparing runtime, scalability, and cost across algorithms, we can make evidence-based decisions about which approaches deliver the best trade-off between accuracy, speed, and operational cost.

Machine Configuration on Google Cloud

Go to console.cloud.google.com, set up your billing, and head to “Compute Engine.” Then click on “Create instance” to configure your virtual machine (VM). The figure below shows the C4 VM series powered by Intel® Xeon® 6 (code-named Granite Rapids) and the 5th Gen Intel® Xeon® (code-named Emerald Rapids) CPUs.

Setting up a virtual machine on Google Cloud

Hyperthreading can introduce performance variability because two threads compete for the same core’s execution resources. For consistent benchmark results, setting “vCPUs to core ratio” to 1 eliminates that variable—more on this in the next section.

vCPUs to core ratio and visible core counts can be set under “Advanced configurations”

Before creating the VM, increase the boot disk size from the left-hand panel—200 GB will be more than enough to install the packages needed for this blog.

Increasing the boot disk size for the virtual machine on Google Cloud

Non-uniform Memory Access (NUMA) Awareness

Memory access is non-uniform on multi-core, multi-socket CPUs. This means the latency and bandwidth of memory operations depend on which CPU core is accessing which region of memory. If you don’t control for NUMA, you’re benchmarking the scheduler, not the CPU and the results can appear inconsistent. Memory affinity is exactly what eliminates that problem by controlling which CPU cores access which memory regions. The Linux scheduler is aware of the NUMA topology of the platform and attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used, rather than letting the scheduler randomly assign work across the system. However, without explicit affinity controls, you can’t guarantee consistent placement for reliable benchmarking.

Let’s do a hands-on NUMA experiment with XGBoost and a synthetic dataset that is large enough to stress memory.

First, provision a VM that spans multiple NUMA nodes, SSH into the instance, and install the dependencies.

sudo apt update && sudo apt install -y python3-venv numactl

Then create and activate a Python virtual environment to install scikit-learn, numpy, and xgboost. Save the script below as xgb_bench.py.

import numpy as np
import xgboost as xgb
from sklearn.datasets import make_classification
from time import time

# 10M samples, 100 features
X, y = make_classification(n_samples=10_000_000, n_features=100, random_state=42)
dtrain = xgb.DMatrix(X, label=y)

params = {
    "objective": "binary:logistic",
    "tree_method": "hist",
    "max_depth": 8,
    "nthread": 0,  # use all available threads
}

start = time()
xgb.train(params, dtrain, num_boost_round=100)
print("Elapsed:", time() - start, "seconds")

Next, run this script in three modes (baseline / numa0 / interleave). Repeat each experiment at least 5 times and report the mean and standard deviation. (This calls for another simple script!)

# Run without NUMA binding
python3 xgb_bench.py
# Run with NUMA binding to a single node
numactl --cpunodebind=0 --membind=0 python3 xgb_bench.py

When assigning tasks to specific physical cores, use the --physcpubind or -C option rather than --cpunodebind.

# Run with interleaved memory across nodes
numactl --interleave=all python3 xgb_bench.py

Which experiment had the smallest mean? How about the standard deviation? For interpreting these numbers, keep in mind that

  • Lower standard deviation for numa0 indicates more stable locality.
  • Lower mean for numa0 vs baseline suggests cross-node traffic was hurting you, and
  • If interleave narrows the gap vs baseline, your workload is bandwidth sensitive and benefits from spreading pages—at potential cost to latency.

If none of these apply to a benchmark, the workload may be compute-bound (e.g., shallow trees, small dataset), or the VM might expose a single NUMA node.

Choosing the Right Benchmarks

When benchmarking classical machine learning algorithms on CPUs, you should build your own testing framework, or leverage existing benchmark suites, or use a hybrid approach if appropriate.

Existing test suites such as scikit-learn_bench and Phoronix Test Suite (PTS) are helpful when you need standardized, reproducible results that others can validate and compare against. They work particularly well if you’re evaluating well-established algorithms like random forest, SVM, or XGBoost where standard datasets provide meaningful insights. Custom benchmarks excel at revealing implementation-specific performance characteristics. For instance, they can measure how different sparse matrix formats affect SVM training times, or how feature preprocessing pipelines impact overall throughput on your specific CPU architecture. The datasets you use directly influence what your benchmark reveals. Feel free to consult the official scikit-learn benchmarks for inspiration. Here’s also a sample set that you can use to create a custom test.

Dataset Size Task Source
Higgs 11M rows Binary classification UCI ML Repo
Airline Delay Variable Multi-class classification BTS
California Housing 20K rows Regression sklearn.datasets.fetch_california_housing
Synthetic Variable Scaling tests sklearn.datasets.make_classification

Synthetic scaling datasets are especially useful to expose differences in cache, memory bandwidth, and I/O.

In the rest of this blog, we illustrate how you can run experiments using the open source scikit-learn_bench which currently supports scikit-learn, cuML, and XGBoost frameworks.

Installations and Benchmarking

Once the GCP VM is initialized, you can SSH into the instance and execute the commands below in your terminal.

sudo apt update && sudo apt upgrade -y 
sudo apt install -y git wget numactl

To install Conda on a GCP VM, you’ll need to account for the CPU architecture. If you’re unsure about the architecture of your VM, you can run

uname -m

before proceeding to

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh

# Use the installer for Linux aarch64 if your VM is based on Arm architecture.
# wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -O ~/miniconda.sh

Next, you need to execute the script and accept the terms of service (ToS).

bash ~/miniconda.sh
source ~/.bashrc

Finally, clone the latest scikit-learn_bench from GitHub, create a virtual environment and install the required Python libraries.

git clone https://github.com/IntelPython/scikit-learn_bench.git
cd scikit-learn_bench
conda env create -n sklearn_bench -f envs/conda-env-sklearn.yml
conda activate sklearn_bench

At this point, you should be able to run a benchmark using the sklbench module and a specific configuration:

python -m sklbench --config configs/xgboost_example.json

By default, sklbench benchmarks both the standard scikit-learn implementations and their optimized counterparts provided by sklearnex (Intel’s accelerated extension)—or other supported frameworks like cuML or XGBoost—and logs results along with hardware and software metadata into result.json. You can customize the output file with --result-file, and include --report to produce an Excel report (report.xlsx). For a list of all supported options, see the documentation.

As discussed earlier, you can use numactl to pin a process and its child processes to specific CPU cores. Here’s how to run sklbench with numactl, binding it to selected cores:

cores="0-3"
export runid="$(date +%Y%m%d%H%M%S)"

numactl --physcpubind $cores python3 -m sklbench 
  --config configs/regular 
  --filters algorithm:library=sklearnex algorithm:device=cpu algorithm:estimator=RandomForestClassifier 
  --result-file $result-${runid}.json

Interpreting Results and Best Practices

The report generator allows you to combine the result files of multiple runs.

python -m sklbench.report --result-files  

The real metric for cloud decision-making is the cost per task, namely,

Cost per task = runtime in hours x hourly price.

Real-world deployments rarely behave like single benchmark runs. To accurately model the cost per task, it’s useful to account for CPU boost behavior, cloud infrastructure variability, and memory topology, as they can all influence performance in ways that aren’t captured by a one-off measurement. To better reflect actual runtime characteristics, I recommend starting with warm-up iterations to stabilize CPU frequency scaling. Then run each experiment multiple times to account for system noise and transient effects. Reporting the mean and standard deviation helps surface consistent trends, while using medians can be more robust when variance is high, especially in cloud environments where noisy neighbors or resource contention can skew averages. For reproducibility, it’s important to fix package versions and use consistent VM image snapshots. Including NUMA configuration in your results helps others understand memory locality effects, which can significantly impact performance. Tools like scikit-learn_bench automate many of these steps, making it easier to produce benchmarks that are both representative and repeatable.

If you found this article valuable, please consider sharing it with your network. For more AI development how-to content, visit Intel® AI Development Resources.

Acknowledgments

The author thanks Neal Dixon, Miriam Gonzales, Chris Liebert, and Rachel Novak for providing feedback on an earlier draft of this work.

Resources

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *