Gain a Better Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow

https://github.com/syrax90/dynamic-solov2-tensorflow2 – Source code of the project described in the article.

Disclaimer

⚠️ First of all, note that this project is not production-ready code.

and Why I Decided to Implement It from Scratch

This project targets people who don’t have high-performance hardware (GPU particularly) but want to study computer vision or at least on the way of finding themselves as a person interested in this area. I tried to make the code as clear as possible, so I used Google’s description style for all methods and classes, comments inside the code to make the logic and calculations more clear and used Single Responsibility Principle and other OOP principles to make the code more human-readable.

As the title of the article suggests, I decided to implement Dynamic SOLO from scratch to deeply understand all the intricacies of implementing such models, including the entire cycle of functional production, to better understand the problems that can be encountered in computer vision tasks, and to gain valuable experience in creating computer vision models using TensorFlow. Looking ahead, I will say that I was not mistaken with this choice, since it brought me a lot of new skills and knowledge.

I would recommend implementing models from scratch to everyone who want to understand their principles of working deeper. That’s why:

When you encounter a misunderstanding about something, you start to delve deeper into the specific problem. By exploring the problem, you find an answer to the question of why a particular approach was invented, and thus expand your knowledge in this area.
When you understand the theory behind an approach or principle, you start to explore how to implement it using existing technical tools. In this way, you improve your technical skills for solving specific problems.
When implementing something from scratch, you better understand the value of the effort, time, and resources that can be spent on such tasks. By comparing them with similar tasks, you more accurately estimate the costs and have a better idea of the value of similar work, including preparation, research, technical implementation, and even documentation.

TensorFlow was chosen as the framework simply because I use this framework for most of my machine learning tasks (nothing special here).
The project represents implementation of Dynamic SOLO (SOLOv2) model with TensorFlow2 framework.

SOLO: A Simple Framework for Instance Segmentation,
Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, Lei Li
arXiv preprint (arXiv:2106.15947)

Dynamic SOLO plot. Image by author. Inspired by arXiv:2106.15947

SOLO (Segmenting Objects by Locations) is a model designed for computer vision tasks, in particular for instance segmentation. It is totally anchor-free framework that predicts masks without any bounding boxes. The paper presents several variants of the model: Vanilla SOLO, Decoupled SOLO, Dynamic SOLO, Decoupled Dynamic SOLO. Indeed, I implemented Vanilla SOLO first because it is the easiest of them all. But I’m not going to publish the code because there is no large distinguish between Vanilla and Dynamic SOLO from implementation point of view.

Model

Actually, the model can be very flexible according to the principles described in the SOLO paper: from the number of FPN layers to the number of parameters in the layers. I decided to start with the simplest implementation. The basic idea of the model is to divide the entire image into cells, where one grid cell can represent only one instance: determined class + segmentation mask.

Backbone

I chose ResNet50 as the backbone because it is a lightweight network that suits for beginning perfectly. I didn’t use pretrained parameters for ResNet50 because I was experimenting with more than just original COCO dataset. However, you can use pretrained parameters if you intend to use the original COCO dataset, as it saves time, speeds up the training process, and improves performance.

backbone = ResNet50(weights='imagenet', include_top=False, input_shape=input_shape)
backbone.trainable = False

Neck

FPN (Feature Pyramid Network) is used as the neck for extracting multi-scale features. Within the FPN, we use all outputs C2, C3, C4, C5 from the corresponding residual blocks of ResNet50 as described in the FPN paper (Feature Pyramid Networks for Object Detection by Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie). Each FPN level represents a specific scale and has its own grid as shown above.

Note: You shouldn’t use all FPN levels if you work with a small custom dataset where all objects are approximately the same scale. Otherwise, you train extra parameters that are not used and consequently require more GPU resources in vain. In that case, you’d have to adjust the dataset so that it returns targets for just 1 scale, not all 4.

Head

The outputs of the FPN layers are used as inputs to layers where the instance class and its mask are determined. Head contains two parallel branches for the aim: Classification branch and Mask kernel branch.

Note: I excluded Mask Feature from the Head based on the Vanilla Head architecture. Mask Feature is described separately below.

Vanilla Head architecture. Image by author. Inspired by arXiv:2106.15947

Classification branch (in the figure above it is designated as “Category”) – is responsible for predicting the class of each instance (grid cell) in an image. It consists of a sequence of Conv2D -> GroupNorm -> ReLU sets arranged in a row. I applied a sequence of 4 such sets.
Mask branch (in the figure above it is designated as “Mask”) – here is a critical nuance: unlike in the Vanilla SOLO model, it does not generate masks directly. Instead, it predicts a mask kernel (referred to as “Mask kernel” in Section 3.2.3 Dynamic SOLO of the paper), which is later applied through dynamic convolution with the Mask feature described below. This design differentiates Dynamic SOLO from Vanilla SOLO by reducing the number of parameters and creating a more efficient, lightweight architecture. The Mask branch predicts a mask kernel for each instance (grid cell) using the same structure as the Classification branch: a sequence of Conv2D -> GroupNorm -> ReLU sets arranged in a row. I also implemented 4 such sets in the model.

Note: For small custom datasets, you can usen even 1 such set for both the mask and classification branches, avoiding training unnecessary parameters

Mask Feature

The Mask feature branch is combined with the Mask kernel branch to determine the final predicted mask. This layer fuses multi-level FPN features to produce a unified mask feature map. The authors of the paper evaluated two approaches to implementing the Mask feature branch: a specific mask feature for each FPN level or one unified mask feature for all FPN levels. Like the authors, I chose the last one. The Mask feature branch and Mask kernel branch are combined via dynamic convolution operation.

Dataset

I chose to work with the COCO dataset format, training my model on both the original COCO dataset and a small custom dataset structured in the same format. I chose COCO format because it has already been widely researched, that makes writing code for parsing the format much easier. Moreover, the LabelMe tool I chose to build my custom dataset able to convert a dataset directly to COCO format. Furthermore, starting with a small custom dataset reduces training time and simplifies the development process. One more reason to create a dataset by yourself is the opportunity to better understand the dataset creation process, participate in it directly, and gain new skills in interacting with tools like LabelMe. A small annotation file can be explored faster and easier than a large file if you want to dive deeper into the COCO format.

Here are some of the sub-tasks regarding datasets that I encountered while implementing the project (they are presented in the project):

Data augmentation. Data augmentation of an image dataset is the process of expanding the dataset by applying various image transformation methods to generate new samples that differ from the original ones. Mastering augmentation techniques is essential, especially for small datasets. I applied methods such as Horizontal flip, Brightness adjustment, Random scaling, Random cropping to give an idea of how to do this and understand how important it is that the mask of the modified image matches its new (augmented) image.
Converting to target. The SOLO model expects a specific data format for the target. It takes a normalized image as input, nothing special. But for the target, the model expects more complex data:
- We have to build a grid for each scale separating it by the number of grid cells for the specific scale. That means that if we have 4 FPN levels – P2, P3, P4, P5 – for different scales, then we will have 4 grids with a certain number of cells for each scale.
- For each instance, we have to define by location the one cell to which the instance belongs among all the grids.
- For each defined, the category and mask of the corresponding instance are applied. There is an additional problem of converting the COCO format mask into a mask consisting of ones for the mask pixels and zeros for the rest of the pixels.
- Combine all of the above into a list of tensors as the target. I understand that TensorFlow prefers a strict set of tensors over structures like a list, but I decided to choose a list for the added flexibility that you might need if you decide to change the number of scales.
Dataset in memory or Generated. The are two main options for dataset allocation: storing samples in memory or generating data on the fly. Despite of allocation in memory has a lot of advantages and there is no problem for a lot of you to upload entire training dataset directory of COCO dataset into memory (19.3 GB only) – I intentionally chose to generate the dataset dynamically using tf.data.Dataset.from_generator. Here’s why: I think it’s a good skill to learn what problems you might encounter interacting with big data and how to solve them. Because when working with real-world problems, datasets may not only contain more samples than COCO datasets, but their resolution may also be much higher. Working with dynamically generated datasets is generally a bit more complex to implement, but it is more flexible. Of course, you can replace it with tf.data.Dataset.from_tensor_slices, if you wish.

Training Process

Loss Function

SOLO doesn’t have a standard Loss Function that is not natively implemented in TensorFlow, so I implemented it by myself.

$$L = L_{cate} + lambda L_{mask}$$

Where:

(L_{cate}) is the conventional Focal Loss for semantic category classification.
(L_{mask}) is the loss for mask prediction.
(lambda) coefficient that is set to 3 in the paper.

$$
L_{mask}
=
frac{1}{N_{pos}}
sum_k
mathbb{1}_{{p^*_{i,j} > 0}}
d_{mask}(m_k, m^*_k)
$$

Where:

(N_{pos}) is the number of positive samples.
(d_{mask}) is implemented as Dice Loss.
( i = lfloor k/S rfloor ), ( j = k mod S ) — Indices for grid cells, indexing left to right and top to bottom.
1 is the indicator function, being 1 if (p^*_{i,j} > 0) and 0 otherwise.

$$L_{Dice}=1 – D(p, q)$$

Where D is the dice coefficient, which is defined as

$$
D(p, q)
=
frac
{2 sum_{x,y} (p_{x,y} cdot q_{x,y})}
{sum_{x,y} p^2_{x,y} + sum_{x,y} q^2_{x,y}}
$$

Where (p_{x,y}), (q_{x,y}) are pixel values at (x,y) for predicted mask p and ground truth mask q. All details of the loss function are described in 3.3.2 Loss Function of the original SOLO paper

Resuming from Checkpoint.

If you use a low-performance GPU, you might encounter situations where training the entire model in a single run is impractical. In order not to lose your trained weights and proceed to execute the training process – this project provides a Resuming from Checkpoint system. It allows you to save your model every n epochs (where n is configurable) and resume training later. To enable this, set load_previous_model to True and specify model_path in config.py.

self.load_previous_model = True
self.model_path = './weights/coco_epoch00000001.keras'

Evaluation Process

To see how effectively your model is trained and how well it behaves on previously unseen images, an evaluation process is used. For the SOLO model, I would break down the process into the following steps:

Loading a test dataset.
Preparing the dataset to be compatible for the model’s input.
Feeding the data into the model.
Suppressing resulting masks with lower probability for the same instance.
Visualization of the original test image with the final mask and predicted category for each instance.

The most irregular task I faced here was implementing Matrix NMS (non-maximum suppression), described in 3.3.4 Matrix NMS of the original SOLO paper. NMS eliminates redundant masks representing the same instance with lower probability. To avoid predicting the same instance multiple times, we need to suppress these duplicate masks. The authors provided Python pseudo-code for Matrix NMS and one of my tasks was to interpret this pseudo-code and implement it using TensorFlow. My implementation:

def matrix_nms(masks, scores, labels, pre_nms_k=500, post_nms_k=100, score_threshold=0.5, sigma=0.5):
    """
    Perform class-wise Matrix NMS on instance masks.

    Parameters:
        masks (tf.Tensor): Tensor of shape (N, H, W) with each mask as a sigmoid probability map (0~1).
        scores (tf.Tensor): Tensor of shape (N,) with confidence scores for each mask.
        labels (tf.Tensor): Tensor of shape (N,) with class labels for each mask (ints).
        pre_nms_k (int): Number of top-scoring masks to keep before applying NMS.
        post_nms_k (int): Number of final masks to keep after NMS.
        score_threshold (float): Score threshold to filter out masks after NMS (default 0.5).
        sigma (float): Sigma value for Gaussian decay.

    Returns:
        tf.Tensor: Tensor of indices of masks kept after suppression.
    """
    # Binarize masks at 0.5 threshold
    seg_masks = tf.cast(masks >= 0.5, dtype=tf.float32)  # shape: (N, H, W)
    mask_sum = tf.reduce_sum(seg_masks, axis=[1, 2])  # shape: (N,)

    # If desired, select top pre_nms_k by score to limit computation
    num_masks = tf.shape(scores)[0]
    if pre_nms_k is not None:
        num_selected = tf.minimum(pre_nms_k, num_masks)
    else:
        num_selected = num_masks
    topk_indices = tf.argsort(scores, direction='DESCENDING')[:num_selected]
    seg_masks = tf.gather(seg_masks, topk_indices)  # select masks by top scores
    labels_sel = tf.gather(labels, topk_indices)
    scores_sel = tf.gather(scores, topk_indices)
    mask_sum_sel = tf.gather(mask_sum, topk_indices)

    # Flatten masks for matrix operations
    N = tf.shape(seg_masks)[0]
    seg_masks_flat = tf.reshape(seg_masks, (N, -1))  # shape: (N, H*W)

    # Compute intersection and IoU matrix (N x N)
    intersection = tf.matmul(seg_masks_flat, seg_masks_flat, transpose_b=True)  # pairwise intersect counts
    # Expand mask areas to full matrices
    mask_sum_matrix = tf.tile(mask_sum_sel[tf.newaxis, :], [N, 1])  # shape: (N, N)
    union = mask_sum_matrix + tf.transpose(mask_sum_matrix) - intersection
    iou = intersection / (union + 1e-6)  # IoU matrix (avoid div-by-zero)
    # Zero out diagonal and lower triangle (keep i= score_threshold                        # boolean mask of those above threshold
    new_scores = tf.where(keep_mask, new_scores, tf.zeros_like(new_scores))

    # Select top post_nms_k by the decayed scores
    if post_nms_k is not None:
        num_final = tf.minimum(post_nms_k, tf.shape(new_scores)[0])
    else:
        num_final = tf.shape(new_scores)[0]
    final_indices = tf.argsort(new_scores, direction='DESCENDING')[:num_final]
    final_indices = tf.boolean_mask(final_indices, tf.greater(tf.gather(new_scores, final_indices), 0))

    # Map back to original indices
    kept_indices = tf.gather(topk_indices, final_indices)
    return kept_indices

Below is an example of images with overlaid masks predicted by the model for an image it has never seen before:

Advice for Implementation from Scratch

Which data do we map to which function? It is very important to make sure that we feed the right data to the model. The data should match what is expected at each layer, and each layer processes the input data so that the output is suitable for the next layer. Because we ultimately calculate the loss function based on this data. Based on the implementation of SOLO, I realized that some goals may not be as simple as they seem at first glance. I described this in the Dataset chapter.
Research the paper. It is impossible to escape reading the paper you are about to build your model based on. I know it is obvious, but despite the many references to other previous works and papers, you need to understand the principles. When you start researching a paper, you may be faced with a lot of other papers that you need to read and understand before you can do so, and this can be quite a challenging task. But usually, even the most up-to-date paper is based on a set of principles that have been known for some time and are not new. This means that you can find a lot of material on the Internet that describes these principles very clearly. You can use LLM programs for this purpose, which can summarize the information, give examples, and help you understand some of the works and papers.
Start with small steps. This is trivial advice, but to implement a computer vision model with millions of parameters, you don’t need to waste time on useless training, dataset preparation, evaluation, etc. if you are in the development stage and are not sure that the model will work correctly. Moreover, if you have a low-performance GPU, the process takes even longer. So, don’t start with huge datasets, many parameters, and a series of layers. You can even let the model overfit in the first stage of development with a small dataset and a small number of parameters, to be sure that the data is correctly matched to the targets of the model.
Debug your code. Debugging your code allows you to be sure that you have expected code behaviour and data value on each step. I understand that everyone who at least once developed a software product knows about it, and they don’t need the advice. But I would like to highlight it anyway because building models, writing Loss Function, preparing datasets for input and targets we interact with math operations and tensors a lot. And it requires increased attention from us unlike routine programming code we face everyday and know how it works without debugging.

Conclusion

This is a brief description of the project without any technical details, to give a general picture and avoid reading fatigue. Obviously, a description of a project dedicated to a computer vision model cannot be fit in one article. If I see interest in the project from readers, I may write a more detailed analysis with technical details.

Gain a Better Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow

Disclaimer

and Why I Decided to Implement It from Scratch

Model

Backbone

Neck

Head

Mask Feature

Dataset

Training Process

Loss Function

Resuming from Checkpoint.

Evaluation Process

Advice for Implementation from Scratch

Conclusion

Thanks for reading!

Leave a Reply Cancel reply

Gain a Better Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow

Disclaimer

and Why I Decided to Implement It from Scratch

Model

Backbone

Neck

Head

Mask Feature

Dataset

Training Process

Loss Function

Resuming from Checkpoint.

Evaluation Process

Advice for Implementation from Scratch

Conclusion

Thanks for reading!

Related Posts

Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 2)

The Hidden Trap of Fixed and Random Effects

Leave a Reply Cancel reply