CLIP Model Overview : Unlocking the Power of Multimodal AI

There is a lot of hype today about LLMs. Engineers often compare and praise recent revolutionary models like ChatGPT, Llama, Gemini, and Mistral, which indeed deserve so much attention for the powerful capabilities that they have. At the same time, developers tend not to mention so many other impactful models that brought a lot of success in the machine learning industry as well.

In this article, I would like to talk about one of the most iconic models developed by OpenAI — CLIP. Released in 2021, CLIP can be used in various settings for NLP or computer vision projects and produces state-of-the-art results on different tasks. While many engineers think of CLIP as just an embedding model — which is true — its application is extremely wide.

In this article, we will cover in detail the CLIP model, including its architecture and training process, performance, and applications.

Contrastive learning

Before discussing the CLIP architecture, let us understand the meaning behind contrastive learning, which plays an integral role in the CLIP design.

Contrastive learning is a self-supervised learning method whose objective consists of teaching an embedding model to produce embeddings such that similar samples are brought closer in the space and dissimilar ones are pushed further away.

Contrastive learning framework. The objective consists of bringing objects of the same class (1 and 2) closer to each other in the embedding space, while pushing them further away from object 3, which belongs to a different class.

Simply speaking, in contrastive learning, the model works with pairs of objects. During training, it does not know whether, in reality, they are similar or not. After predicting their distance (similarity) through calculated embeddings, the loss function is computed. Basically, there are two cases:

The initial objects were similar. The loss function value leads to a weight update in a way that adjusts the embeddings to bring their similarity closer next time.
The initial objects were dissimilar. In this case, the model updates its weights so that the similarity between this pair of embeddings will be lower next time.

Architecture & Training

CLIP developers collected a huge dataset of 400M pairs (image, text). Every image was provided with a textual description.

The goal consisted of constructing meaningful embedding representations such that the similarity between them would measure how similar a given text description is with respect to an image. For that, the authors took two already existing model architectures:

Text embedding model
Image embedding model

The initial 400M pairs of images and texts were split into batches. Every image and text in each batch was passed through the image or text embedding model to produce embeddings. As a result, if there were n embedding pairs in the batch, then n embeddings would be produced for images and texts.

After that, the cosine pairwise similarity matrix is constructed between image and text embeddings.

Every element on the main diagonal of the pairwise matrix represents the similarity between an image and the text that were coupled together in the batch from the beginning. Since the text description corresponds well to the image, the similarities on the main diagonal should be maximized.

On the other hand, elements off the diagonal were not coupled together and come from different pairs. Therefore, their similarities should be minimized.

CLIP workflow diagram. Source: Learning Transferable Visual Models From Natural Language Supervision. Image adapted by the author.

Calculated similarities are then passed to the cross-entropy loss function, which is used to perform a weight update for both embedding models.

Details

The main parameters of CLIP are the embedding models used to encode texts and images:

Text is encoded with a Transformer-based model whose architecture is similar to BERT.
For images, the encoding can be done either by traditional convolutional networks (ResNet) or by a Vision Transformer model (ViT).

Both models were trained from scratch and, by default, generate embeddings of size 512. Given the fact that the dataset size (400M pairs) is large, ViT is usually preferred over ResNet.

Advantages

There are several strong sides of CLIP worth noting:

CLIP can be used for various tasks, not just for embedding generation (examples are in the following section).
Zero-shot CLIP performance is comparable with simple supervised baselines using a linear classifier on top of ResNet features.
Computational efficiency: many computations can be run in parallel.

Applications

Embeddings

The most obvious CLIP application consists of using it for text and image embedding calculation. The embeddings can be used separately for text or image tasks, for example, in similarity search pipelines or RAG systems.

Additionally, both texts and images can be used together if there is a need to associate an image with its corresponding text description.

Image classification

Apart from the generation of image and text embeddings, one of the strongest sides of CLIP is its ability to solve other tasks in a zero-shot learning style.

For example, let’s take an image classification task. If we are given an animal image with the objective to identify its class from a list of animals, we can embed every name of an animal. Then, by finding the most similar textual embedding to a given image embedding, we can directly identify the animal class.

CLIP can estimate the similarity between an image and class labels in order to classify the image.

Speaking of this recognition method, studies have shown that it is better to embed every text (class name) using the following prompt structure: “a photo of ”. For other task types, the best prompt might differ.

OCR

OCR stands for optical character recognition and simply means recognizing text from images. OCR tasks are usually solved by specially trained supervised models. Nevertheless, CLIP’s impressive capabilities also allow it to identify text on images in zero-shot way.

If there is a list of all possible texts that can appear in the image, then, in a similar manner as in the previous case, we can encode all possible options and choose the most similar pair. However, in this case, the number of all possible words or texts is usually much larger than the typical number of labels in an image classification task. Encoding all of them would be very long and inefficient. That is why CLIP is rarely used for OCR tasks with long text sequences.

In terms of OCR, CLIP works much better for small words, or even better for symbol recognition. For example, it is easy to set up a digit recognition task with CLIP since there are only 10 classes (each class represents a digit between 0 and 9).

One interesting observation is that zero-shot CLIP achieves only an 88% accuracy score on the famous MNIST handwritten digit recognition task, while other simple models easily reach 99% accuracy. What is necessary to keep in mind is that despite the fact that CLIP has impressive zero-shot capabilities, there can still exist very specific image types on which CLIP has not been trained.

CLIP only achieves 88% accuracy on recognition of handwritten digits. Source: MNIST dataset | TensorFlow

Here are some important notes:

CLIP is not good for some abstract tasks, like counting objects in a photo, estimating how close two objects are to each other in the image, etc.

CLIP produces similar zero-shot performance for standard computer vision tasks compared to other older models like ImageNet, etc. Nevertheless, to be able to beat them, the authors claim that CLIP must be trained on hardware that surpasses modern hardware by a thousand times, which is infeasible under current circumstances.

Conclusion

In this article, we have studied the architectural principles of CLIP. Trained on 400M (image, text) pairs, CLIP reached state-of-the-art performance on many tasks. While CLIP generally fails on some abstract downstream tasks, it still has brilliant capabilities to perform other standard computer vision tasks using zero-shot techniques.

Resources

All images unless otherwise noted are by the author.