we deal with classification algorithms in machine learning like Logistic Regression, K-Nearest Neighbors, Support Vector Classifiers, etc., we don’t use evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).
Instead, we generate a confusion matrix, and based on the confusion matrix, a classification report.
In this blog, we aim to understand what a confusion matrix is, how to calculate Accuracy, Precision, Recall and F1-Score using it, and how to select the relevant metric based on the characteristics of the data.
To understand the confusion matrix and classification metrics, let’s use the Breast Cancer Wisconsin Dataset.
This dataset consists of 569 rows, and each row provides information on various features of a tumor along with its diagnosis, whether it is malignant (cancerous) or benign (non-cancerous).
Now let’s build a classification model for this data to classify the tumors based on their features.
We now apply Logistic Regression to train a model on this dataset.
Code:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
column_names = [
"id", "diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean",
"compactness_mean", "concavity_mean", "concave_points_mean", "symmetry_mean", "fractal_dimension_mean",
"radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se", "compactness_se", "concavity_se",
"concave_points_se", "symmetry_se", "fractal_dimension_se", "radius_worst", "texture_worst",
"perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst",
"concave_points_worst", "symmetry_worst", "fractal_dimension_worst"
]
df = pd.read_csv("C:/wdbc.data", header=None, names=column_names)
# Drop ID column
df = df.drop(columns=["id"])
# Encode target: M=1 (malignant), B=0 (benign)
df["diagnosis"] = df["diagnosis"].map({"M": 1, "B": 0})
# Split features and target
X = df.drop(columns=["diagnosis"])
y = df["diagnosis"]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Confusion Matrix and Classification Report
conf_matrix = confusion_matrix(y_test, y_pred, labels=[1, 0]) # 1 = Malignant, 0 = Benign
report = classification_report(y_test, y_pred, labels=[1, 0], target_names=["Malignant", "Benign"])
# Display results
print("Confusion Matrix:n", conf_matrix)
print("nClassification Report:n", report)
# Plot Confusion Matrix
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Purples", xticklabels=["Malignant", "Benign"], yticklabels=["Malignant", "Benign"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()
Here, after applying logistic regression to the data, we generated a confusion matrix and a classification report to evaluate the model’s performance.
First let’s understand the confusion matrix
From the above confusion matrix
’60’ represents the correctly predicted Malignant Tumors, which we refer to as “True Positives”.
‘4’ represents the incorrectly predicted Benign Tumors which are actually Malignant Tumors, which we refer to as “False Negatives”.
‘1’ represents the incorrectly predicted Malignant Tumors which are actually Benign Tumors, which we refer to as “False Positives”.
‘106’ represents the correctly predicted Benign Tumors, which we refer to as “True Negatives”.
Now let’s see what we can do with these values.
For that we consider the classification report.

From the above classification report, we can say that
For Malignant:
– Precision is 0.98, which means when the model predicts the tumor as Malignant, it is correct 98% of the time.
– Recall is 0.94, which means the model correctly identified 94% of all Malignant Tumors.
– F1-score is 0.96, which balances both the precision and recall.
For Benign:
– Precision is 0.96, which means when the model predicts the tumor as Benign, it is correct 96% of the time.
– Recall is 0.99, which means the model correctly identified 99% of all Benign Tumors.
– F1-score is 0.98.
From the report we can observe that the accuracy of the model is 97%.
We also have Macro Average and Weighted Average, let’s see how these are calculated.
Macro Average
Macro Average calculates the average of all metrics (precision, recall and f1-score) across both classes giving equal weight to each class, regardless of how many samples each class contains.
We use macro average, when we want to know the performance of model across all classes, ignoring class imbalances.
For this data:

Weighted Average
Weighted Average also calculates the average of all metrics but gives more weight to the class with more samples.
In the above code, we used test_size = 0.3
, which means we set aside 30% for testing which means we are using 171 samples from a data of 569 samples for a test set.
The confusion matrix and classification report are based on this test set.
Out of 171 samples of test set, we have 64 Malignant tumors and 107 Benign tumors.
Now let’s see how this weighted average is calculated for all metrics.

Weighted average gives us a more realistic performance measure when we have the class imbalanced datasets.
We now got an idea of each and every term in the classification report and also how to calculate the macro and weighted averages.
Now let’s see what’s the use of confusion matrix for generating a classification report.
In classification report we have different metrics like accuracy, precision etc. and these metrics are calculated using the values in the confusion matrix.
From the confusion matrix we have
True Positives (TP) = 60
False Negatives (FN) = 4
False Positives (FP) = 1
True Negatives (TN) = 106
Now let’s calculate the classification metrics using these values.

This is how we calculate the classification metrics using a confusion matrix.
But why do we have four different classification metrics instead of one metric like accuracy? It’s because the different metrics show different strengths and weaknesses of the classifier based on the context of the data.
Now let’s come back to the Wisconsin Breast Cancer Dataset which we used here.
When we applied a logistic regression model to this data, we got an accuracy of 97% which is high, which may make us think that the model is efficient.
But let’s consider another metric called ‘recall’ which is 0.94 for this model, which means out of all the malignant tumors we have in the test set the model was able to identify 94% of them correctly.
Here the model missed 6% of malignant cases.
In real-world scenarios, mainly healthcare applications like cancer detection, if we miss a positive case, it might delay the diagnosis and treatment.
By this we can understand that even if we have an accuracy of 97%, we need to look deeper based on context of data by considering different metrics.
So, what we can do now, should we aim for a recall value of 1.0 which means all the malignant tumors are identified correctly, but if we push recall to 1.0 then the precision drops because the model may classify more benign tumors as malignant.
When the model classifies more benign tumors as malignant, there would be unnecessary anxiety, and it may require additional tests or treatments.
Here we should aim to maximize ‘recall’ by keeping the ‘precision’ reasonably high.
We can do this by changing the thresholds set by classifiers to classify the samples.
Most of the classifiers set the threshold to 0.5, and if we change it 0.3, we are saying that even if it is 30% confident, classify it as malignant.
Now let’s use a custom threshold of 0.3.
Code:
# Train logistic regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
# Predict probabilities
y_probs = model.predict_proba(X_test)[:, 1]
# Apply custom threshold
threshold = 0.3
y_pred_custom = (y_probs >= threshold).astype(int)
# Classification Report
report = classification_report(y_test, y_pred_custom, target_names=["Benign", "Malignant"])
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_custom, labels=[1, 0])
# Plot Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(
conf_matrix,
annot=True,
fmt="d",
cmap="Purples",
xticklabels=["Malignant", "Benign"],
yticklabels=["Malignant", "Benign"]
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix (Threshold = 0.3)")
plt.tight_layout()
plt.show()
Here we applied a custom threshold of 0.3 and generated a confusion matrix and a classification report.

Classification Report:

Here, the accuracy increased to 98% and the recall for malignant increased to 97% and the precision remained the same.
We earlier discussed that there might be a decrease in precision if we try to maximize the recall but here the precision remains same, this depends on the data (whether balanced or not), preprocessing steps and tuning the threshold.
For medical datasets like this, maximizing recall is often preferred over accuracy or precision.
When we consider datasets like spam detection or fraud detection, we prefer precision and same as in above method we try to improve precision by tuning threshold accordingly and also by balancing the tradeoff between precision and recall.
We use f1-score when the data is imbalanced, and when we prefer both precision and recall where neither false positives nor false negatives can be ignored.
Dataset Source
Wisconsin Breast Cancer Dataset
Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). Breast Cancer Wisconsin (Diagnostic) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B.
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license and is free to use for commercial or educational purposes as long as proper credit is given to original source.
Here we discussed what a confusion matrix is and how it is used to calculate the different classification metrics like accuracy, precision, recall and f1-score.
We also explored when to prioritize which classification metric, using the Wisconsin cancer dataset as an example, where we preferred maximizing recall.
I hope you found this blog helpful in understanding confusion matrix and classification metrics more clearly.
Thanks for reading.