Skip to content

How to connect Text and Images Rabeya Tus Sadia Becoming Human: Artificial Intelligence Magazine – Medium

  • by

Part 2: Understanding Zero-Shot Learning with the CLIP model

Photo by Lenin Estrada on Unsplash

Since openAI first made the CLIP model available, it’s been a little over a year since this method of connecting images and caption texts was established. This enormous model was trained on 400 million (!) different pairs of images and captions that were found on the internet.

We’ll get an understanding of how zero-shot learning works with CLIP models with hands-on examples at the end of this post. Learning how to classify images without the need of any explicit labels is the objective of the CLIP.

Intuition

Just like traditional supervised models, CLIP has two stages: the training stage (learning) and the inference stage (making predictions). I encourage you to read blog posts specifically about CLIP and how it’s trained/used or, better yet — the paper.

In short, in the training stage, CLIP learns about images by “reading” auxiliary text (i.e. sentences) corresponding to each image as in the example below.

Example of a candidate input to the CLIP architecture. Photo by The Lucky Neko on Unsplash

Even if you’ve never seen a cat, you should be able to read this text and figure out that the three things in the picture are “cats.” If you saw enough cat images with captions that said “cat,” you might get really good at figuring out if there are cats in a picture or not.
In the same way, the model can figure out how certain phrases and words match up with certain patterns in the images by looking at 400 million pairs of images and texts of different objects. Once it knows this, the model can use the information it has learned to apply it to other classification tasks. But hold on a minute.

You might be wondering, isn’t this “auxiliary text” kind of like a label, so this isn’t the “label-free learning” I promised at the beginning?
The extra information, like captions, is a way to keep an eye on things, but they are not labels! With this extra information, we can use unstructured data that is full of information without having to manually parse it into a single label (e.g., “These are my three cute cats…” “cats”).
Putting together a label takes time and leaves out information that could be useful. By using CLIP’s method, we can get around this bottleneck and give the model the most information possible.

Diving Deeper into the CLIP model with Zero-Shot Training

How exactly is the model able to learn from these auxiliary texts?

As suggested by the architecture’s name, CLIP uses a technique called contrastive learning in order to understand the relationship between image and text pairings.

Summary of the CLIP approach. Figure from here

In essence, CLIP aims to minimize the difference between the encodings of the image and its corresponding text. In other words, the model should learn to make the encodings of the images and the encodings of its corresponding text as similar as possible.

Let’s break down this idea a bit more.

What are encodings?Encodings are just representations of data in lower dimension (green and purple boxes in the figure above). In an ideal world, an image’s or text’s encoding should show the most important and unique information about that image or text.
For example, all images of cats should have the same encodings because they all have cats in them. Images of dogs, on the other hand, should have different encodings.
In this perfect world, where the encodings of similar objects are the same and the encodings of different objects are different, it’s easy to put the images into groups. If we give the model an image whose encoding is similar to other “cat” encodings it has seen, the model can say that the image is of a cat.
The best way to classify images seems to be to learn how to encode them in the best way. In fact, this is the whole point of CLIP (and most of deep learning)! We start with bad encodings (random encodings for each image), and we want the model to learn the best encodings (i.e. cat images have similar encodings).

Hands-on example of CLIP (zero-shot image classifier)

To use CLIP model as zero-shot classifier all you need to do is define a list of possible classes, or descriptions, and CLIP will make a prediction for which class a given image is most likely to fall into based on its prior knowledge. Think of it as asking the model “which of these captions best matches this image?”

In this post, we will walk through a demonstration of how to test out CLIP’s performance on your image datasets. This is the Public flower classification dataset. The code is here colab notebook.

First, download and install all the CLIP dependencies.

To try CLIP out on your own data, make a copy of the notebook in your drive and make sure that under Runtime, the GPU is selected (Google Colab will give you a free GPU for use). Then, we make a few installs along with cloning the CLIP Repo.

Then download the classification dataset.

Here, the classes and images we want to test are stored in folders in the test set. We are passing images with this _tokenization.txt.

In this code section, you can see some autogenerated captions for the images for classification. You can use your own prompt engineering for this. You can add different captions to create the right classification for CLIP identifying images the best. You can use your own intuition to increase the result.

The final step is to pass your test images through a prediction step.

CLIP takes an image and a list of possible class captions as inputs. You can define the class captions as you see fit in the _tokenization.txt file. Be sure to make sure they stay in the same order as the alphabetically sorted class_names (defined by the folder structure).

This is the main inference network. Basically, we will iterate over the images in our test folder, and then we will send the images to the network along with our tokenization and see where clip sends the images into the different tokenization and finally see if those match up with the ground truth.

Then we use some metrics here. You can see that we got higher accuracy for dandelion than daisy. When you use CLIP for your classification task, it is useful to experiment with different class captions for your classification ontology and remember that CLIP was trained to differentiate between image captions.

On the flowers dataset, we tried the following ontologies and saw these results:

“dandelion” vs “daisy”] –> 46% accuracy (worse than guessing)”dandelion flower” vs “daisy flower” –> 64% accuracy”picture of a dandelion flower” vs “picture of a daisy flower” –> 97% accuracy

These results show the importance of providing the right class descriptions to CLIP and express the richness of the pretraining procedure, a feature that is altogether lost in traditional binary classification. OpenAI calls this process “prompt engineering”.

For more on CLIP research, consider reading the paper and checking out OpenAI’s blog post.

This is all for today.

Stay happy and happy Learning!

References:

https://sh-tsang.medium.com/review-dall-e-zero-shot-text-to-image-generation-f9de7a383374https://towardsdatascience.com/understanding-contrastive-learning-d5b19fd96607https://medium.com/mlearning-ai/having-fun-with-clip-features-part-i-29dff92bbbcdhttps://roboflow.com/

How to connect Text and Images was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

 Part 2: Understanding Zero-Shot Learning with the CLIP modelPhoto by Lenin Estrada on UnsplashSince openAI first made the CLIP model available, it’s been a little over a year since this method of connecting images and caption texts was established. This enormous model was trained on 400 million (!) different pairs of images and captions that were found on the internet.We’ll get an understanding of how zero-shot learning works with CLIP models with hands-on examples at the end of this post. Learning how to classify images without the need of any explicit labels is the objective of the CLIP.IntuitionJust like traditional supervised models, CLIP has two stages: the training stage (learning) and the inference stage (making predictions). I encourage you to read blog posts specifically about CLIP and how it’s trained/used or, better yet — the paper.In short, in the training stage, CLIP learns about images by “reading” auxiliary text (i.e. sentences) corresponding to each image as in the example below.Example of a candidate input to the CLIP architecture. Photo by The Lucky Neko on UnsplashEven if you’ve never seen a cat, you should be able to read this text and figure out that the three things in the picture are “cats.” If you saw enough cat images with captions that said “cat,” you might get really good at figuring out if there are cats in a picture or not.In the same way, the model can figure out how certain phrases and words match up with certain patterns in the images by looking at 400 million pairs of images and texts of different objects. Once it knows this, the model can use the information it has learned to apply it to other classification tasks. But hold on a minute.You might be wondering, isn’t this “auxiliary text” kind of like a label, so this isn’t the “label-free learning” I promised at the beginning?The extra information, like captions, is a way to keep an eye on things, but they are not labels! With this extra information, we can use unstructured data that is full of information without having to manually parse it into a single label (e.g., “These are my three cute cats…” “cats”).Putting together a label takes time and leaves out information that could be useful. By using CLIP’s method, we can get around this bottleneck and give the model the most information possible.Diving Deeper into the CLIP model with Zero-Shot TrainingHow exactly is the model able to learn from these auxiliary texts?As suggested by the architecture’s name, CLIP uses a technique called contrastive learning in order to understand the relationship between image and text pairings.Summary of the CLIP approach. Figure from hereIn essence, CLIP aims to minimize the difference between the encodings of the image and its corresponding text. In other words, the model should learn to make the encodings of the images and the encodings of its corresponding text as similar as possible.Let’s break down this idea a bit more.What are encodings?Encodings are just representations of data in lower dimension (green and purple boxes in the figure above). In an ideal world, an image’s or text’s encoding should show the most important and unique information about that image or text.For example, all images of cats should have the same encodings because they all have cats in them. Images of dogs, on the other hand, should have different encodings.In this perfect world, where the encodings of similar objects are the same and the encodings of different objects are different, it’s easy to put the images into groups. If we give the model an image whose encoding is similar to other “cat” encodings it has seen, the model can say that the image is of a cat.The best way to classify images seems to be to learn how to encode them in the best way. In fact, this is the whole point of CLIP (and most of deep learning)! We start with bad encodings (random encodings for each image), and we want the model to learn the best encodings (i.e. cat images have similar encodings).Hands-on example of CLIP (zero-shot image classifier)To use CLIP model as zero-shot classifier all you need to do is define a list of possible classes, or descriptions, and CLIP will make a prediction for which class a given image is most likely to fall into based on its prior knowledge. Think of it as asking the model “which of these captions best matches this image?”In this post, we will walk through a demonstration of how to test out CLIP’s performance on your image datasets. This is the Public flower classification dataset. The code is here colab notebook.First, download and install all the CLIP dependencies.To try CLIP out on your own data, make a copy of the notebook in your drive and make sure that under Runtime, the GPU is selected (Google Colab will give you a free GPU for use). Then, we make a few installs along with cloning the CLIP Repo.Then download the classification dataset.Here, the classes and images we want to test are stored in folders in the test set. We are passing images with this _tokenization.txt.In this code section, you can see some autogenerated captions for the images for classification. You can use your own prompt engineering for this. You can add different captions to create the right classification for CLIP identifying images the best. You can use your own intuition to increase the result.The final step is to pass your test images through a prediction step.CLIP takes an image and a list of possible class captions as inputs. You can define the class captions as you see fit in the _tokenization.txt file. Be sure to make sure they stay in the same order as the alphabetically sorted class_names (defined by the folder structure).This is the main inference network. Basically, we will iterate over the images in our test folder, and then we will send the images to the network along with our tokenization and see where clip sends the images into the different tokenization and finally see if those match up with the ground truth.Then we use some metrics here. You can see that we got higher accuracy for dandelion than daisy. When you use CLIP for your classification task, it is useful to experiment with different class captions for your classification ontology and remember that CLIP was trained to differentiate between image captions.On the flowers dataset, we tried the following ontologies and saw these results:”dandelion” vs “daisy”] –> 46% accuracy (worse than guessing)”dandelion flower” vs “daisy flower” –> 64% accuracy”picture of a dandelion flower” vs “picture of a daisy flower” –> 97% accuracyThese results show the importance of providing the right class descriptions to CLIP and express the richness of the pretraining procedure, a feature that is altogether lost in traditional binary classification. OpenAI calls this process “prompt engineering”.For more on CLIP research, consider reading the paper and checking out OpenAI’s blog post.This is all for today.Stay happy and happy Learning!References:https://sh-tsang.medium.com/review-dall-e-zero-shot-text-to-image-generation-f9de7a383374https://towardsdatascience.com/understanding-contrastive-learning-d5b19fd96607https://medium.com/mlearning-ai/having-fun-with-clip-features-part-i-29dff92bbbcdhttps://roboflow.com/How to connect Text and Images was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.  Read More image-classification, machine-learning, zero-shot-learning, clip-model, text-image-connection 

Leave a Reply

Your email address will not be published. Required fields are marked *