{"id":502,"date":"2024-07-28T21:19:34","date_gmt":"2024-07-28T21:19:34","guid":{"rendered":"https:\/\/summergeometry.org\/sgi2024\/?p=502"},"modified":"2024-07-31T01:32:02","modified_gmt":"2024-07-31T01:32:02","slug":"a-deeper-understanding-openais-clip-model","status":"publish","type":"post","link":"https:\/\/summergeometry.org\/sgi2024\/a-deeper-understanding-openais-clip-model\/","title":{"rendered":"A Deeper Understanding OpenAI\u2019s CLIP Model"},"content":{"rendered":"\n<h2 class=\"wp-block-heading has-normal-font-size\"><strong>Author:<\/strong> <a href=\"https:\/\/krischebo.github.io\/\" data-type=\"link\" data-id=\"https:\/\/krischebo.github.io\/\">Krishna Chebolu<\/a><br><strong>Teammates:<\/strong> <a href=\"https:\/\/github.com\/Betty987?tab=overview&amp;from=2023-12-01&amp;to=2023-12-31\" data-type=\"link\" data-id=\"https:\/\/github.com\/Betty987?tab=overview&amp;from=2023-12-01&amp;to=2023-12-31\">Bethlehem Tassew<\/a> and <a href=\"https:\/\/www.linkedin.com\/in\/kimberly-herrera-2p357\/\" data-type=\"link\" data-id=\"https:\/\/www.linkedin.com\/in\/kimberly-herrera-2p357\/\">Kimberly Herrera<\/a><br><strong>Mentor:<\/strong> Dr. <a href=\"https:\/\/ankita-shukla.github.io\/\" data-type=\"link\" data-id=\"https:\/\/ankita-shukla.github.io\/\">Ankita Shukla<\/a><\/h2>\n\n\n\n<h2 class=\"wp-block-heading has-larger-font-size\">Introduction<\/h2>\n\n\n\n<p>For the past two weeks, our project team mentored by Dr. Ankita Shukla set out to understand the inner workings of OpenAI\u2019s CLIP model. Specifically, we were interested in gaining a mathematical understanding of feature spaces&#8217; geometric and topological properties.\u00a0<\/p>\n\n\n\n<p><a href=\"https:\/\/openai.com\/index\/clip\/\" data-type=\"link\" data-id=\"https:\/\/openai.com\/index\/clip\/\">OpenAI&#8217;s CLIP<\/a> (Contrastive Language-Image Pre-Training) is a versatile and powerful model designed to understand and generate text and images. CLIP is trained to connect text and images by learning from a large dataset of images paired with their corresponding textual descriptions. The model is trained using a <a href=\"https:\/\/encord.com\/blog\/guide-to-contrastive-learning\/#:~:text=Contrastive%20learning%20is%20an%20approach,instances%20should%20be%20farther%20apart.\" data-type=\"link\" data-id=\"https:\/\/encord.com\/blog\/guide-to-contrastive-learning\/#:~:text=Contrastive%20learning%20is%20an%20approach,instances%20should%20be%20farther%20apart.\">contrastive learning<\/a> approach, where it learns to predict which text snippet is associated with which image from a set of possible pairs. This allows CLIP to understand the relationship between textual and visual information.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"361\" src=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/CLIP-1024x361.png\" alt=\"\" class=\"wp-image-504\" srcset=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/CLIP-1024x361.png 1024w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/CLIP-300x106.png 300w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/CLIP-768x271.png 768w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/CLIP-1536x541.png 1536w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/CLIP-2048x722.png 2048w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/CLIP-1200x423.png 1200w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/CLIP-1980x698.png 1980w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1: OpenAI&#8217;s CLIP architecture as it appears in the paper. CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. Then OpenAI uses this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset\u2019s classes into captions such as \u201ca photo of a dog\u201d and predict the class of the caption CLIP estimates best pairs with a given image.<\/figcaption><\/figure>\n\n\n\n<p>CLIP uses two separate encoders: a text encoder (based on the Transformer architecture) and an image encoder (based on a convolutional neural network or a vision transformer). Both encoders produce embeddings in a shared <a href=\"https:\/\/www.baeldung.com\/cs\/dl-latent-space#:~:text=Formally%2C%20a%20latent%20space%20is,other%20in%20the%20latent%20space.\" data-type=\"link\" data-id=\"https:\/\/www.baeldung.com\/cs\/dl-latent-space#:~:text=Formally%2C%20a%20latent%20space%20is,other%20in%20the%20latent%20space.\">latent space<\/a> (also called a feature space). By aligning text and image embeddings in the same space, CLIP can perform tasks that require cross-modal understanding, such as image captioning, image classification with natural language labels, and more.<\/p>\n\n\n\n<p>CLIP is trained on a vast dataset containing 400 million image-text pairs collected online. This extensive training data allows it to generalize across various domains and tasks. One of CLIP\u2019s standout features is its ability to perform <a href=\"https:\/\/www.ibm.com\/topics\/zero-shot-learning#:~:text=Zero%2Dshot%20learning%20(ZSL),those%20categories%20or%20concepts%20beforehand.\" data-type=\"link\" data-id=\"https:\/\/www.ibm.com\/topics\/zero-shot-learning#:~:text=Zero%2Dshot%20learning%20(ZSL),those%20categories%20or%20concepts%20beforehand.\">zero-shot learning<\/a>. It can handle new tasks without requiring task-specific training data, simply by understanding the task description in natural language. More information can be found in <a href=\"https:\/\/arxiv.org\/abs\/2103.00020\" data-type=\"link\" data-id=\"https:\/\/arxiv.org\/abs\/2103.00020\">OpenAI\u2019s paper<\/a>.<\/p>\n\n\n\n<p>In our attempts to understand the inner workings of the feature spaces, we employed tools from UMAP, persistence homology, subspace angles, cosine similarity matrices, and Wasserstein distances.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-larger-font-size\">Our Study &#8211; Methodology and Results<\/h2>\n\n\n\n<p>All of our teammates started with datasets that contained image-caption pairs. We classified images into various categories using their captions and embedded them using CLIP. Then we used <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\" data-type=\"link\" data-id=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> or <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.manifold.TSNE.html\" data-type=\"link\" data-id=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.manifold.TSNE.html\">t-SNE<\/a> plots to visualize their characteristics.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"810\" height=\"405\" src=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-181819.png\" alt=\"\" class=\"wp-image-507\" srcset=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-181819.png 810w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-181819-300x150.png 300w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-181819-768x384.png 768w\" sizes=\"auto, (max-width: 810px) 100vw, 810px\" \/><figcaption class=\"wp-element-caption\">Figure 2: A screenshot of a UMAP of 1000 images from the <a href=\"https:\/\/www.kaggle.com\/datasets\/adityajn105\/flickr8k\" data-type=\"link\" data-id=\"https:\/\/www.kaggle.com\/datasets\/adityajn105\/flickr8k\">Flickr 8k dataset<\/a> from <a href=\"https:\/\/www.kaggle.com\/\" data-type=\"link\" data-id=\"https:\/\/www.kaggle.com\/\">Kaggle <\/a>divided into five categories (animal, human, sport, nature, and vehicle) is shown. Here we can also observe that the images (colored) are embedded differently than their corresponding captions (gray). Although not shown here, the captions are also clustered around categories.<\/figcaption><\/figure>\n\n\n\n<p>After this preliminary visualization, we desire to delve deeper. We introduced noise, a Gaussian blur, to our images to test CLIP\u2019s robustness. We added the noise in increments (for example mean = 0, standard deviation = {1,2,3,4,5}) and encoded them as we did the original image-caption pairs. We then made <a href=\"https:\/\/towardsdatascience.com\/persistent-homology-with-examples-1974d4b9c3d0\" data-type=\"link\" data-id=\"https:\/\/towardsdatascience.com\/persistent-homology-with-examples-1974d4b9c3d0\">persistence diagrams<\/a> using <a href=\"https:\/\/ripser.scikit-tda.org\/en\/latest\/\" data-type=\"link\" data-id=\"https:\/\/ripser.scikit-tda.org\/en\/latest\/\">ripser<\/a>. We also followed the same procedure within the various categories to understand how noise impacts not only the overall space but also their respective subspaces. These diagrams for the five categories from the Flickr 8k dataset can be found in this <a href=\"https:\/\/colab.research.google.com\/drive\/1rD0c41rmpRDPIbt92HELQzhTCdqNuVxm?usp=sharing\" data-type=\"link\" data-id=\"https:\/\/colab.research.google.com\/drive\/1rD0c41rmpRDPIbt92HELQzhTCdqNuVxm?usp=sharing\">Google Colab notebook<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"178\" src=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/pd_1000-1024x178.png\" alt=\"\" class=\"wp-image-510\" srcset=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/pd_1000-1024x178.png 1024w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/pd_1000-300x52.png 300w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/pd_1000-768x133.png 768w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/pd_1000-1536x266.png 1536w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/pd_1000-1200x208.png 1200w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/pd_1000-1980x343.png 1980w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/pd_1000.png 1990w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3: The same 1000 images from the Flickr 8k dataset with increasing noise are seen above. Visually, no significant difference can be observed. The standard deviation of the Gaussian blur increases from left to right.<\/figcaption><\/figure>\n\n\n\n<p>Visually, you can observe that there is no significant difference, which attests to CLIP\u2019s robustness. However, visual assurance is not enough. Thus, we used <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.wasserstein_distance.html\" data-type=\"link\" data-id=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.wasserstein_distance.html\">Scipy\u2019s<\/a> <a href=\"https:\/\/en.wikipedia.org\/wiki\/Wasserstein_metric\" data-type=\"link\" data-id=\"https:\/\/en.wikipedia.org\/wiki\/Wasserstein_metric\">Wasserstein\u2019s distance<\/a> calculation to note how different each persistence diagram is from the other. Continuing the same Flickr 8k dataset, for each category, we obtain the values shown in Figure 4.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"241\" height=\"475\" src=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-182048.png\" alt=\"\" class=\"wp-image-511\" srcset=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-182048.png 241w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-182048-152x300.png 152w\" sizes=\"auto, (max-width: 241px) 100vw, 241px\" \/><figcaption class=\"wp-element-caption\">Figure 4: Wasserstein distances in each category. You can see the distance between original images with respect to images with Gaussian blur of std. dev = 1 is not high compared to std. dev = 2 or above. This implies that the original set is not as different from the set of images blurred with std. dev. = 1 as compared to std. dev. = 2 which in turn is not as different as the set of blurred images with std. dev. = 3, and so on. This property holds for all five categories.<\/figcaption><\/figure>\n\n\n\n<p>Another question to understand is <em>how similar are each of the categories to one another.<\/em> This question can be answered by calculating the subspace angles. After embedding, each category can be seen as occupying a space that can often be far away from another category\u2019s space\u2013 we want to quantify how far away, so we use <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.linalg.subspace_angles.html\" data-type=\"link\" data-id=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.linalg.subspace_angles.html\">subspace angles<\/a>. Results for the Flickr 8k dataset example are shown in Figure 5.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"415\" height=\"162\" src=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-182100.png\" alt=\"\" class=\"wp-image-512\" srcset=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-182100.png 415w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-27-182100-300x117.png 300w\" sizes=\"auto, (max-width: 415px) 100vw, 415px\" \/><figcaption class=\"wp-element-caption\">Figure 5: Subspace angles of each category pair in the five categories introduced earlier in the post. All values are displayed in degrees. You can observe that the angle between human- and animal-category images is ~0.92\u00b0 compared to human and nature: ~2.3\u00b0; which makes sense as humans and animals are more similar than humans compared to nature. It is worthwhile to note that the five categories are simplifying the dataset too much as they do not capture the nuances of the captions. More categories or descriptions of categories would lead to higher accuracy in the quantity of the subspace angle.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading has-larger-font-size\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>At the start, our team members were novices in the CLIP model, but we concluded as <em>lesser <\/em>novices. Through the two weeks, Dr. Shukla supported us and enabled us to understand the inner workings of the CLIP model. It is certainly thrilling to observe how AI around us is constantly evolving, but at the heart of it is mathematics governing the change. We are excited to possibly explore further and perform more analyses.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We attempt to understand the inner workings of OpenAI&#8217;s revolutionary CLIP model using tools from geometry and topology.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[40,37],"tags":[],"ppma_author":[25],"class_list":["post-502","post","type-post","status-publish","format-standard","hentry","category-math","category-research"],"authors":[{"term_id":25,"user_id":0,"is_guest":1,"slug":"cap-kris-chebo","display_name":"kris.chebo","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","first_name":"","last_name":"","user_url":"","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/posts\/502","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/comments?post=502"}],"version-history":[{"count":7,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/posts\/502\/revisions"}],"predecessor-version":[{"id":644,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/posts\/502\/revisions\/644"}],"wp:attachment":[{"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/media?parent=502"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/categories?post=502"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/tags?post=502"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/ppma_author?post=502"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}