{"id":832,"date":"2024-08-01T19:39:31","date_gmt":"2024-08-01T19:39:31","guid":{"rendered":"https:\/\/summergeometry.org\/sgi2024\/?p=832"},"modified":"2024-08-01T19:42:04","modified_gmt":"2024-08-01T19:42:04","slug":"topology-of-feature-spaces","status":"publish","type":"post","link":"https:\/\/summergeometry.org\/sgi2024\/topology-of-feature-spaces\/","title":{"rendered":"Topology of feature spaces"},"content":{"rendered":"\n<p class=\"has-small-font-size wp-block-paragraph\"><strong>CLIP<\/strong> is a system designed to determine which image matches which piece of text in a group of images and texts.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">1. <strong>How it works:<\/strong><\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 Embedding Space: Think of this as a special place where both images and text are transformed into numbers.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 Encoders: CLIP has two parts that do this transformation:<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2013 Image Encoder: This part looks at images and converts them into a set of numbers (called embeddings).<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2013 Text Encoder: This part reads text and also converts it into a set of numbers(embeddings).<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">2. <strong>Training Process:<\/strong><\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 Batch: Imagine you have a bunch of images and their corresponding texts in a group(batch).<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 Real Pairs: Within this group, some images and texts actually match (like an image of a cat and the word \u201dcat\u201d).<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 Fake Pairs: There are many more possible combinations that don\u2019t match (like an image of a cat and the word \u201ddog\u201d).<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">3. <strong>Cosine Similarity:<\/strong> This is a way to measure how close two sets of numbers (embeddings) are. Higher similarity means they are more alike.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">4. <strong>CLIP\u2019s Goal: <\/strong>CLIP tries to make the embeddings of matching images and text (real pairs) as close as possible. At the same time, it tries to make the embeddings of non-matching pairs (fake pairs) as different as possible.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">5. <strong>Optimization:<\/strong><\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 Loss Function: This is a mathematical way to measure how good or bad the current matchings are.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 Symmetric Cross-Entropy Loss: CLIP uses a specific type of loss function that looks at the similarities of both real and fake pairs and adjusts the embeddings to improve the matchings.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">In essence, CLIP learns to accurately match images and texts by continuously improving how it transforms them into numbers so that correct matches are close together and incorrect ones are far apart.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">After learning CLIP, I chose my data set and got to work:<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">The <strong>Describable Textures Dataset (DTD)<\/strong> is an evolving collection of textural images in the wild, annotated with a series of human-centric attributes, inspired by the perceptual properties of textures.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">The package contains:<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">1. Dataset images, train, validation, and test.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">2. Ground truth annotations and splits used for evaluation.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">3. imdb.mat file, containing a struct holding file names and ground truth labels.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">Example images:<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.34%\">\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXeRebmrLv4FN5bCFLWZDArn13kEHQA6pkJlt2caWv_mxM5sgYh0hkJtrw3h0HU_aReGWaBTMUHpnU0OFi0KAVpKBFIfmNV5XLn-suR-8QhuDpWhpLmK9oYt-2t55No_mUu2g46ZM-kNq1T-3l4awZfD6Qj6?key=QyykNDgaJ6L6J8HKWRdjwQ\" alt=\"\" \/><\/figure>\n<\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\"><div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" src=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/pitted_0025-edited.jpg\" alt=\"\" class=\"wp-image-835\" style=\"width:219px;height:auto\" srcset=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/pitted_0025-edited.jpg 640w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/pitted_0025-edited-300x225.jpg 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/figure>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\"><div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"639\" height=\"503\" src=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/banded_0004.jpg\" alt=\"\" class=\"wp-image-834\" style=\"width:208px;height:auto\" srcset=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/banded_0004.jpg 639w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/banded_0004-300x236.jpg 300w\" sizes=\"auto, (max-width: 639px) 100vw, 639px\" \/><\/figure>\n<\/div><\/div>\n<\/div>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">There are 47 texture classes, with 120 images each for a total of 5640 images in this data set. The above shows \u2018cobwebbed\u2019,  \u2018pitted,\u2019 and \u2018banded\u2019. I did the <strong>t-SNE<\/strong> visualization by class for all the classes but realized this wasn\u2019t very helpful for analysis. It was the same for <strong>UMAP<\/strong>. So I decided to sample 15 classes and then visualize:<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdsEefSF-LHKh-jdIncnU7S5OLr77A6uLXKloxe7yhBUx5YHK5dK7DbvXwj-oRc9umBBMMB5ofMP5Rl7i3JE0A5BGcC25McXLmsyUh9dLnzGIMwqFcpDdtA25RdpFXt6xJCpWcgKrAIu0kHojfrC9HAVPg5?key=QyykNDgaJ6L6J8HKWRdjwQ\" alt=\"\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:50%\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"977\" height=\"682\" src=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/better_UMAP.png\" alt=\"\" class=\"wp-image-836\" srcset=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/better_UMAP.png 977w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/better_UMAP-300x209.png 300w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/better_UMAP-768x536.png 768w\" sizes=\"auto, (max-width: 977px) 100vw, 977px\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">In the t-SNE for 15 classes, we see that \u2019polka-dotted\u2019 and \u2019dotted\u2019 are clustered together. This intuitively makes sense. To further our analysis, we computed the subspace angles between the classes. Many pairs of categories have an angle of 0.0, meaning their feature vectors are very close to each other in the feature space. This suggests that these textures are highly similar or share similar feature representations. For instance:<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 crosshatched and striped<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 dotted and grid<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">\u2022 dotted and polka-dotted<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\"><strong>Then came the tda<\/strong>: A persistence diagram summarizes the topological features of a data set across multiple scales. It captures the birth and death of features such as connected components, holes, and voids as the scale of observation changes. In a persistence diagram, each feature is represented as a point in a two-dimensional space, where the x-coordinate corresponds to the &#8220;birth&#8221; scale and the y-coordinate corresponds to the &#8220;death&#8221; scale. This visualization helps in understanding the shape and structure of the data, allowing for the identification of significant features that persist across various scales while filtering out noise.<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">I added 3 levels of noise (0.5, 1.0, 2.0) to the images and then extracted features. I visualized these features on a persistence diagram. Here are some examples of those results. We can see that for H_0 at all noise levels, there is one persistent feature so there is one connected component. The death of this persistent feature varies slightly. H_1 at all noise levels there aren\u2019t any highly persistent features, with most points being around the diagonal. The features in H_1 tend to \u201cclump up together\u201d and die quicker as the noise level goes up.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXelH6md5rbN7QwzKZd0InsJTGwEcGZATDFOLD1w2L8XpouihrPsuxKzDyOaEBk2CRREd2j0toFzu78MGkcdZns0AHghS1QcrW02ja5tHdf2siLjN-G0JIAJIjl3QJEXtMajJzvH7vSd7AeKjPo-mYVQIfvI?key=QyykNDgaJ6L6J8HKWRdjwQ\" alt=\"\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXf-IVmO4kNeo25hUh83TCYbhcCAwHjiM6baE0cTdlMrge0Zq2mJBIm-ybrSMGSfQrm5s1hAKFPRRZhkq7XQ8PSiNBZzHgIcqAqOiV1mWznH3fhOwtNqTigOI4GIrtHFGc1owWaD8g19FThjzqAGTNSxCReb?key=QyykNDgaJ6L6J8HKWRdjwQ\" alt=\"\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdzWV1esMxo1UEhlc_RqlkxBbGwqmb4FFtF5axnQYz6DM7IalCpHEUzeLKzbZAwu1zWb-_aC21wwSSpi3i7xZqNYw3KIteJwsW8pGJD4uArQWy1sEjsLQ42UoB7LX6C6OlqQqGTy_ERf-OrKJWjxSKaObc?key=QyykNDgaJ6L6J8HKWRdjwQ\" alt=\"\" \/><\/figure>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">I then computed the <strong>distances between the diagrams<\/strong> with no noise and those with noise. Here are some of those results. Unsurprisingly, with greater levels of noise, there is greater distance.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"214\" src=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/Screenshot-2024-08-01-123828-1024x214.png\" alt=\"\" class=\"wp-image-837\" style=\"width:621px;height:auto\" srcset=\"https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/Screenshot-2024-08-01-123828-1024x214.png 1024w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/Screenshot-2024-08-01-123828-300x63.png 300w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/Screenshot-2024-08-01-123828-768x160.png 768w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/Screenshot-2024-08-01-123828-1200x251.png 1200w, https:\/\/summergeometry.org\/sgi2024\/wp-content\/uploads\/2024\/08\/Screenshot-2024-08-01-123828.png 1388w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">Finally, we wanted to <strong>test the robustness of CLIP <\/strong>so we classified images with respect to noise. The goal was to see if the results we saw with respect to the topology of the feature space corresponded to the <strong>classification<\/strong> results. These were the classification accuracies:<\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\"> <img loading=\"lazy\" decoding=\"async\" width=\"473\" height=\"270\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXfYfMSwfwV-0GyQX1oqN0If4Dr8m48s89d8U3LvFrrBCN3GoMaxEVi2_swCKXG64QQVu0WIFtu6yVwLJrFnds2ptCrhNPMes_FWUORwXZ0u7S1-oLGyB-obYzkvPPNkIoW0YItDH89Ctn1wYksUccDCYVQ?key=QyykNDgaJ6L6J8HKWRdjwQ\"><\/p>\n\n\n\n<p class=\"has-small-font-size wp-block-paragraph\">We hope to discuss our results further!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>CLIP is a system designed to determine which image matches which piece of text in a group of images and texts. 1. How it works: \u2022 Embedding Space: Think of this as a special place where both images and text are transformed into numbers. \u2022 Encoders: CLIP has two parts that do this transformation: \u2013 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"ppma_author":[18],"class_list":["post-832","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"authors":[{"term_id":18,"user_id":0,"is_guest":1,"slug":"cap-kimberlyherrera","display_name":"kimberlyherrera","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","first_name":"","last_name":"","user_url":"","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/posts\/832","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/comments?post=832"}],"version-history":[{"count":3,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/posts\/832\/revisions"}],"predecessor-version":[{"id":841,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/posts\/832\/revisions\/841"}],"wp:attachment":[{"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/media?parent=832"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/categories?post=832"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/tags?post=832"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/summergeometry.org\/sgi2024\/wp-json\/wp\/v2\/ppma_author?post=832"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}