
# Introduction
Machine learning has become an integral part of many companies, and businesses that don’t utilize it risk being left behind. Given how critical models are in providing a competitive advantage, it’s natural that many companies want to integrate them into their systems.
There are many ways to set up a machine learning pipeline system to help a business, and one option is to host it with a cloud provider. There are many advantages to developing and deploying machine learning models in the cloud, including scalability, cost-efficiency, and simplified processes compared to building the entire pipeline in-house.
The cloud provider selection is up to the business, but in this article, we will explore how to set up a machine learning pipeline on the Google Cloud Platform (GCP).
Let’s get started.
# Preparation
You must have a Google Account before proceeding, as we will be using the GCP. Once you’ve created an account, access the Google Cloud Console.
Once in the console, create a new project.
Then, before anything else, you need to set up your Billing configuration. The GCP platform requires you to register your payment information before you can do most things on the platform, even with a free trial account. You don’t need to worry, though, as the example we’ll use won’t consume much of your free credit.
Please include all the billing information required to start the project. You might also need your tax information and a credit card to ensure they are ready.
With everything in place, let’s start building our machine learning pipeline with GCP.
# Machine Learning Pipeline with Google Cloud Platform
To build our machine learning pipeline, we will need an example dataset. We will use the Heart Attack Prediction dataset from Kaggle for this tutorial. Download the data and store it somewhere for now.
Next, we must set up data storage for our dataset, which the machine learning pipeline will use. To do that, we must create a storage bucket for our dataset. Search for ‘Cloud Storage’ to create a bucket. It must have a unique global name. For now, you don’t need to change any of the default settings; just click the create button.
Once the bucket is created, upload your CSV file to it. If you’ve done this correctly, you will see the dataset inside the bucket.
Next, we’ll create a new table that we can query using the BigQuery service. Search for ‘BigQuery’ and click ‘Add Data’. Choose ‘Google Cloud Storage’ and select the CSV file from the bucket we created earlier.
Fill out the information, especially the project destination, the dataset form (create a new dataset or select an existing one), and the table name. For the schema, select ‘Auto-detect’ and then create the table.
If you’ve created it successfully, you can query the table to see if you can access the dataset.
Next, search for Vertex AI and enable all the recommended APIs. Once that’s finished, select ‘Colab Enterprise’.
Select ‘Create Notebook’ to create the notebook we’ll use for our simple machine learning pipeline.
If you are familiar with Google Colab, the interface will look very similar. You can import a notebook from an external source if you want.
With the notebook ready, connect to a runtime. For now, the default machine type will suffice as we don’t need many resources.
Let’s start our machine learning pipeline development by querying data from our BigQuery table. First, we need to initialize the BigQuery client with the following code.
from google.cloud import bigquery
client = bigquery.Client()
Then, let’s query our dataset in the BigQuery table using the following code. Change the project ID, dataset, and table name to match what you created previously.
# TODO: Replace with your project ID, dataset, and table name
query = """
SELECT *
FROM `your-project-id.your_dataset.heart_attack`
LIMIT 1000
"""
query_job = client.query(query)
df = query_job.to_dataframe()
The data is now in a pandas DataFrame in our notebook. Let’s transform our target variable (‘Outcome’) into a numerical label.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
df['Outcome'] = df['Outcome'].apply(lambda x: 1 if x == 'Heart Attack' else 0)
Next, let’s prepare our training and testing datasets.
df = df.select_dtypes('number')
X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
⚠️ Note: df = df.select_dtypes('number')
is used to simplify the example by dropping all non-numeric columns. In a real-world scenario, this is an aggressive step that could discard useful categorical features. This is done here for simplicity, and normally feature engineering or encoding would typically be considered.
Once the data is ready, let’s train a model and evaluate its performance.
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Model Accuracy: {accuracy_score(y_test, y_pred)}")
The model accuracy is only around 0.5. This could certainly be improved, but for this example, we’ll proceed with this simple model.
Now, let’s use our model to make predictions and prepare the results.
result_df = X_test.copy()
result_df['actual'] = y_test.values
result_df['predicted'] = y_pred
result_df.reset_index(inplace=True)
Finally, we will save our model’s predictions to a new BigQuery table. Note that the following code will overwrite the destination table if it already exists, rather than appending to it.
# TODO: Replace with your project ID and destination dataset/table
destination_table = "your-project-id.your_dataset.heart_attack_predictions"
job_config = bigquery.LoadJobConfig(write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE)
load_job = client.load_table_from_dataframe(result_df, destination_table, job_config=job_config)
load_job.result()
With that, you have created a simple machine learning pipeline inside a Vertex AI Notebook.
To streamline this process, you can schedule the notebook to run automatically. Go to your notebook’s actions and select ‘Schedule’.
Select the frequency you need for the notebook to run, for example, every Tuesday or on the first day of the month. This is a simple way to ensure the machine learning pipeline runs as required.
That’s it for setting up a simple machine learning pipeline on GCP. There are many other, more production-ready ways to set up a pipeline, such as using Kubeflow Pipelines (KFP) or the more integrated Vertex AI Pipelines service.
# Conclusion
Google Cloud Platform provides an easy way for users to set up a machine learning pipeline. In this article, we learned how to set up a pipeline using various cloud services like Cloud Storage, BigQuery, and Vertex AI. By creating the pipeline in notebook form and scheduling it to run automatically, we can create a simple, functional pipeline.
I hope this has helped!
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.