Yasser Benigmin - Domain Adaptation in the Era of Foundation Models
1:24:49

Yasser Benigmin - Domain Adaptation in the Era of Foundation Models

Cohere 27.03.2026 151 просмотров 8 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
In this presentation, we address domain adaptation in semantic segmentation, where deep learning models rely heavily on large labeled datasets and struggle with domain shift, limiting real-world generalization. We show how Foundation Models (FMs) can be adapted to overcome these challenges under resource constraints through three key contributions. First, we present DATUM, a one-shot unsupervised domain adaptation approach that personalizes text-to-image diffusion models to generate diverse, style-consistent training data from a single target image. Next, we introduce CLOUDS, a collaborative framework in which multiple foundation models, such as CLIP, large language models, diffusion models, and Segment Anything Model, work together to generate synthetic data and automate the creation of high-quality pseudo-labels for self-training, enabling improved domain generalization.. Finally, we discuss FLOSS, a training-free strategy for open-vocabulary segmentation that enhances CLIP’s performance by automatically discovering class-specific “expert” text templates. Yasser Benigmin is a recent PhD graduate in Computer Vision within the Multimedia team at Telecom Paris and the VISTA team at LIX (Laboratoire d'Informatique de l'X) at École Polytechnique, supervised by Stéphane Lathuilière, Vicky Kalogeiton, and Slim Essid. His research focuses on domain adaptation for semantic segmentation leveraging foundation models, with a particular emphasis on resource-constrained scenarios. Previously, he interned at INRIA Paris in the Astra-Vision team, working on open-vocabulary semantic segmentation under Raoul de Charette. Yasser holds an engineering degree from École des Mines de Saint-Étienne and completed an exchange year at EURECOM. This session is brought to you by the Cohere Labs Open Science Community - a space where ML researchers, engineers, linguists, social scientists, and lifelong learners connect and collaborate with each other. We'd like to extend a special thank you to Benedict Emoekabu and Mayank Bhaskar, Leads of our Computer Vision group for their dedication in organizing this event. If you’re interested in sharing your work, we welcome you to join us! Simply fill out the form at https://forms.gle/ALND9i6KouEEpCnz6 to express your interest in becoming a speaker. Join the Cohere Labs Open Science Community to see a full list of upcoming events (https://tinyurl.com/CohereLabsCommunityApp).

Оглавление (17 сегментов)

Segment 1 (00:00 - 05:00)

So yeah basically yeah like I said um so yeah the presentation will be uh will be about domain adaptation in the aerial foundation models and uh yeah this is a presentation of my PhD my supervisors are like Stefan Latier my thesis director working today atria Vicki Kaleton who's a professor at echop poly techchnique and meid who's a senior research scientist at Indidia So um let's get started. So I mean if we computer vision if we have to take the uh three most important tasks or the three most popular tasks would be image classification where the object is to uh the objective is to assign I mean an image level label uh to any given image. We have object detection where the object where the objective is to uh like create some bounding boxes around objects and also like give them a semantic class. And then we have the task of image segmentation which is our task of interest today where the objective is to actually assign a pixel a class to every pixel in the image basically. And when we take uh the task of semantic segmentation if we have a high resolution image like this one um to be able to segment it fully like pixel by pixel um it takes a lot of time because it has around 2 million pixels as you can see. Uh this is an image taken from the cityscapes data set really popular data set in the semantic segmentation uh community and uh yeah like I said to be able to collect uh millions of images like this creates a bottleneck uh basically a annotation bottleneck. So being able to collect a lot of data in this task is a little bit more difficult than uh than image classification for instance. So something we can say is to do is I mean to do uh synthetic data training where the objective is to say okay maybe I cannot collect a lot of images from the um uh from real life it takes a lot of time. What if I use a game engine? So today we have game engines that like can generate a lot of images as you can see. Uh we can take a game engine of a really good video game that can have uh really realistic images. We can take for instance uh GTA uh video games and then take lot of generate images and since the computer generating the images the computer can also generate the annotations of these images. So we can have a huge amount of synthetic data from just by just using uh game engines and then if we take this label synthetic data and then we train a model on this synthetic data and then also we do the test on the same synthetic data. I mean if the training was well done usually we have a high accuracy on synthetic data because the model has been trained and tested on the same kind of data. So everything is good. We're happy. However, if we do test this synthetic trained model on real life data because that's what we care about. We see that we have low accuracy on real life data and this is because of what we call domain shift. So the question we're going to try to answer is what is domain shift basically. So the domain shift we're going to talk about today will be basically a domain shift in times of in terms of appearance. As you can see on the left, this is synthetic data, synthetic images coming from GTA 5. On the right we have real life data as you can see. And basically this is what you call synthetic wheel domain shift. I mean the classes are the same. We have the same car on the left uh and on the right as well. But as we can see the texture is different, the illumination is different. And this is the kind of domain shift we will try to tackle today. The um domain shift that is related with the appearance of images. Same thing goes from real to adverse weather conditions. I mean on light images objects are more difficult to recognize than on daylight images as we can see on the middle. So basically these are the two main domain shift that we will try to talk about today. And one of the main settings that the research community developed in order to tackle this domain shift problem is unsupervised domain adaptation where we assume that during training time we have access to uh label source domain but also some unlabeled images from the target domain because these are the ones that are hard to collect and then during inference we need to have a model that can work on what we call here distri distributed distribution shifted target domain basically. So we have a kind of difference in distribution between the source and target domain and that's the objective of unsupervised domain adaptation which is to train a model using label source labelled data and unlabelled target domain which is distribution shifted and then yet at inference we want to have a model that can work well on this target domain. So this is unsupervised domain adaptation and basically the objective in order to do so uh if we take a segmentation model and then we give it synthetic and real life data. If we look

Segment 2 (05:00 - 10:00)

at the feature representations of these two domains basically we're going to see if we project this on a two dimensional axis using some PCA algorithms and projection algorithms we see that those points are very far away from each other as you can see. And the objective of unsupervised domain adaptation is to take these two distributions and then try to learn what we call domain invariant features by basically uh learning the features that are invariant from one domain to another and then have them basically collapse into one area like basically we should not be able to classify between synthetic data and real life data. If we are able to do so then we achieved what we call adaptation. Then for the model for the segmentation model synthetic data and real life data is basically the same. So that's the ultimate goal of unsupervised domain adaptation which is to learn domain invarian features and to do so like we have seen many paradigms in the community. The first one were discrepancy based methods where the researchers were trying to like create some statistical um distances and then they were trying to minimize that blooming training like maximum mean discrepancy and other metrics like that. We saw also adversarial based methods uh where the objective is to try to train in a adversarial way a discriminator to be able to uh classify between features uh coming from source domain and the target domain and uh it was trained in adversarial way. So uh we had some um some like gradient reversal in order to like train the source extractor and the domain discrim domain discrimination in a recid way. And the last methods that uh were really popular in the last years are selftraining where the objective is to actually basically train a model on the target data using its own predictions what we call pseudo labels and this is then the last family which is family of methods which is self is what we're going to see today. So what another question we can ask is why supervisor adaptation is important. Uh this is something that is used in many domains. It can be medical images when it comes to for instance classifying uh um tumors from one MRI machine to another MRI machine. Uh it can be for instance autonomous driving. You may want to train an autonomous vehicle to uh drive uh fully autonomously in New York for instance and then have this same car work really well in Paris or London or other cities. It can also be satellite images. uh areas are different but we may not be able to train on satellite signatures coming from the whole world. So basically we can we won't maybe train our model to recognize areas on a given region and still have this model work reasonably well on other regions of the world. So uh I mean and supervisor adaptation is used in many areas and that's why it's really important to uh to work on this setting and uh doing my PhD if we look at this uh this image as you can see since 2012 we have the Alex moment uh which like created for us like deep learning started being like really popular and we saw uh in 2016 the ResNets appearing mascar CNN like deep learning architectures that are evolving and u also like deep learning architecture that are really tailored for giving tasks like object detection, semantic segmentation and so on. And uh since 2021 something happened where we started seeing the appearance of huge models like clip for instance a model that is trained on 400 million images. So the size of the data sets got extremely big but also the size of the models as we can see in stable diffusion it's a trained on 2. 8 8 billion image text pairs. Chad GPT 300 billion tokens uh SAM as well which is trained on more than 100 billion masks. So since 2021 uh we had this appearance of these really big models trained on huge data. So they are big because they have a lot of parameters but also they have been trained on a lot of data and basically in this 2021 we can try to maybe do a split between two eras. we had the supervised learning era from 2012 up to 2021 but then now we are in what we call the foundation models era uh where we have like these big models trying to handle a lot of data and uh that's when I started my PhD in late 2021 so um uh the objective the question we're going to try to answer now uh is actually the following is how can foundation models since we are in this foundation models era how can foundation models advance domain adaptation for the task of semantic segmentation. So this will be the topic of the presentation and in order to um to answer this question we're going to try to tackle some settings and then see how can foundation models help us advance that. The first one is going to be oneshot domain adaptation uh that we will see later. But the objective here we assume that we are in a resource constraint scenario where we only have one image from the target domain. We're going to

Segment 3 (10:00 - 15:00)

see domain generalization where we don't assume having absolutely no knowledge about the target domain during training and we only have access to here to the label source domain and still we want to generalize to completely unseen domains. We're going to also like for the sake of time but uh we also tackled during my PhD blackbox domain adaptation where we assume not having access to the source data but to an API model that can give us some predictions that are completely um like one hot prediction. So we don't have any knowledge about neither the weight of the API model or the logit associated to its predictions and still we need to use this model to train a local model using some unlabelled data from the target domain. And the last setting is a setting where we tackle open vocabulary semantic segmentation uh which is a setting that uses a foundation model. In this case it will be clip and uh and yeah basically the three works we're going to present are these three one short domain adaptation domain generalization and open vocabulary symmetric segmentation. So for the sake of time I will skip a blackbox do adaptation. So the first one was a CVPR workshop um 2023. I'll present them after that management organization which was accepted at CVPR 2024 and the last one which was accepted last year's ICCV uh 2025. So the first one is that uh the name of the method and the question we're going to ask is how can we uh we're going to try to answer basically is how can we overcome data scarcity in the target domain during the training phase. So basically like I said oneshot domain adaptation is slightly different than a supervised domain adaptation uh because we only have access to only one image from the target domain. So we have the label source domain one image from the target and still we need to build a model that is uh that can have a good performance on the target domain uh doing inference. So um at the time when we started working on this problem in 2022 um state-of-the-art UDA methods were really good really popular uh like the classical uh UDA methods like DA former and HRDA. These are two existing methods that are really good when it comes to unsupervised domain adaptation for the task of semantic segmentation. But when we take these methods and apply them to unsupervised domain adaptation when we put them in this uh resource constraint scenario we see that their performance drops heavily like we take the formulary drops by 20 by 20% on MIU and same thing for HODA. So basically MIU is the metric we use for uh semantic segmentation basically given an input image we have prediction map and ground map and then we compute the area of overlap divided by the area of union. So if we have the same the if the predicted map is exactly the same as the ground truth would have an IU of 100%. — So yeah state methods they struggle in unsupervised in oneshot doain adaptation and the previous works that were doing uh that were trying to solve um the oneshot domain adaptation uh setting were basically using style transfer. It was popular at the time style transfer. They were taking source images. They were taking the single target image and they were pasting basically the statistics of the target image onto the stylized images in order to create a new stylized data set by keeping the same annotation. As you can see the annotation didn't change uh before and after style transfer. But the input image now have some has some statistics coming from the target the target image. It was doing it was done like using some instance normalization uh techniques. Basically uh the question we ask is like here we are doing only style transfer but can we enable both content and style diversity because that's the ultimate goal. Uh here the object didn't change in terms of shape as you can see the track on the right is the same as left. Uh but in this foundation models era can we enable both content and style diversity? Uh this is the question we're going to ask we're going to try to answer basically. So uh in 2022 we had the emergence of what we call text to image diffusion models. So basically these are models that are trained on uh on a lot of images like we saw earlier like billions of images and basically they are trained in a using a recrosction loss in order to basically create new images. So diffusion models at inference time they can generate for us photorealistic images and still they give us like uh the ability to control the content and the style and this is what we're going to use in our setting of one short domain adaptation. So basically the pipeline is simple. We're going to take a text to image diffusion model. We're going to give it some text uh conditioning as you can see a photo of a car, bus, photo of motorcycle etc. And then we're going to generate some images right. So here we do this. We can generate some images and then we're going to take these images and then plug them into this existing UDA methods that

Segment 4 (15:00 - 20:00)

we saw earlier. Uh so like we like we said earlier UDA methods struggle in one short domain adaptation but because they have been built uh by taking into account that they're going to have access to lot of images from the target domain and since in one shot UDA we don't have lot of images let's try to generate synthetic data and then plug this data into existing UDA methods. And now we have what we call here what we call absudo target domain which is actually a synthetic domain generated using a diffusion model. So here UDA methods they're going to kept we are not going to make any change into this existing UD methods we are going to just generate synthetic data basically as you can see. So now that we take we use these UDN methods as we can see the performance improves like a DA former plus a diffusion model. like data coming from a diffusion model the performance improves by 4% on MIU same thing for HDA when we plug this synthetic data into existing a methods they improve and this is the results we got um like uh by going from GTA which is a synthetic domain to CDscapes so generating synthetic data using a diffusion model and then giving it to a existing unsupervised do adaptation methods help us to have good performance on cityscapes data set which is a real life data sets uh basically it's a data set composed of images from German cities basically so we are happy we have an improvement so synthetic data helps us to uh domain adaptation but if we look at the generated data we have a photo of a car that looks like this traffic sign that looks like this same thing for the bus and same thing here's the motorcycle image basically these images are very different than the target domain we want to mimic basically this is what we care out the target domain. These are images from cityscapes. This is how German cities looks like and the diffusion model which at the time we use stable diffusion cannot generate images that looks like German cities. So we have we still have a discrepancy between the generated data and the target domain that we care about. So the technique we're going to try to do now is how can we generate data that is aligned with the target domain. So the technique we're going to use is dream boot. So dream boot is a technique that helps you to personalize a diffusion model to generate the new content um to generate a new content. Basically like let's say we have an image of a dog that dream that stable diffusion has never saw before and that dream boot helps you actually take a diffusion model and then inject this new knowledge of this new dog into its pre-trained knowledge. And basically you can generate new images of this same dog in new environments. And this knowledge of this environment comes from the pre-trained knowledge of the diffusion model. And this is what we will do. Basically if we look a little bit more into details of dreamut technique. It's a fine tuning technique that takes uh images of the dog and takes a pre-trained diffusion model and then what it does is that it associates a unique identifier in the text prompt because it's a text condition stable diffusion model. So it takes a new token a really rare token that is will be associated to this new images to this new concept basically and then by doing fine tuning like by training the model to regenerate these images of the dog and still again like I said associating this rail token to these new images we're going to have a personalized texture diffusion model a fine tuned model basically and this fine tuned model when it is prompted with this uh rail token that we see here like a this I a uh and then we put the token the rail token dog in the beach. We see that we have this same dog that we had here in the beginning generated in the beach. So the model can still recover the dog but still can also recover the environments around the dog. So here the beach and here also the colorful carpet and that's what we will do in our case. We will personalize the text to diffus the personalize the diffusion model to generate data that looks like the target domain. So the target domain will be the concept that will be injected. Basically it looks like this. So we will finetune um we will finetune our diffusion model with the single target image uh using reboot. So we take our target image the only one we have access to. We're going to do some multi-cropping strategy and then we get some a few samples from the single image we have and then we're going to associate like as we can see here this rare token to these images. Then we're going to finetune our diffusion model. And by doing that now we can generate as you can see on the right which is exactly like this same step we're going to put a photo of a var car and this var now by generating new images we're going to get car that looks like um which is more realistic basically that looks like the target domain that we wanted to mimic. So basically that's what will do what we will do basically we use class specific prompts like I said plus this unique identifier that we use during training

Segment 5 (20:00 - 25:00)

to increase damage diversity but also uh the alignment with the target domain. And now we move from this to this. Basically here as you can see a photo of a traffic sign uh when we give it to a texture image diffusion model it gives an image that looks like this. web. Now, if we use this new web token that we injected with the personalized text to image diffusion model, we can generate images that are more aligned with the target domain. Same thing. So, basically, we go from images that looks like this to this. And the images here on the in the middle are more closer to the target domain than the initial images, right? And that's what we wanted basically at the end of the day. we wanted to generate images are that are closer to the target domain. So basically now if we plug this new data into the existing UDA methods we see that we improve we keep improving. So our data our new data that we generated using our personalized diffusion model is helping us to actually segment really well the cityscapes data set and the improvements we have an improvement of 4. 2 4. 7 on two different methods DA former and HRDA. So this is something that we definitely want and uh one question we can ask is how long should we fine-tune the text to image diffusion model because like we said dream boot involves fine-tuning of a diffusion model. So if we look at the uh the training steps basically of the um finetuning zero step basically means that the model hasn't been fine tuned. It means that we are using the stable diffusion as it is. As we can see, the images are really like definitely not realistic and do not look like the target domain that we want to mimic after 200 steps. This is the images that we used in also in our final paper. But if we kept the training going, if we kept training at 800 steps, we see that the images uh are even more closer to the target domain. So here these images are even better for adaptation. Right? This at 800 steps, we have images that are really close to the target domain. But when we look at the results at 800 steps, if you use the generated data after 800 steps of fine-tuning, you get a decrease in performance on both methods. Why? Because actually we lose diversity. That's the problem. If we go back to this slide, you see the images are nearly the same. And uh what we want in order to do adaptation, you still want to have like diversity. preserve diversity and still have a certain degree of alignment. So there is a sweet spot that we need to find between fine-tuning diffusion model but still keeping its initial diversity. So that's why we actually we did some ablation studies in the paper and then we saw that at 200 step this is the best thing to do because it's the sweet spot between having diverse images but still aligned images with the target domain. So when we do synthetic data generation you want to also make sure that you don't lose diversity. Diversity is extremely important when it comes to like having deep learning models that can generalize and yeah as takeaways of our method basically is that uh the first thing to do that we believe I mean is important is personalizing text to image diffusion model is really important especially if you have a target domain u and like you don't have many images of the target domain I believe that personalizing the diffusion model on these few images coming from the target domain is extremely important and it can generate that can generate a lot of data but it requires uh requires like careful fine tuning and as we can see here we generated synthetic data and then we used like self training because uh hda and the former are self tuning uh self- training based methods so that's what we did but again we didn't have any filtering on the generated data as you can see we generated data as they are and then we train our unsupervised domain adaptation methods on this unfiltered synthetic data and I'm going to come back to this filtering at the end of the presentation. But here in our method, we didn't do any filtering on the synthetic data. Now, a second question we can ask is what if we don't have access to the target domain? Basically, it's the following. domain, meaning we only have access to the source domain. But doing inference, the target domain can be anything. It can be images coming from different regions of the world uh from different weather conditions and we don't know what we're going to expect um at inference time. In this case we are in a domain generalization setting. So it's a slightly harder setting than oneshot domain adaptation. And this is the second work that we will present. So the question here is how can we train a source model on a source domain only and still

Segment 6 (25:00 - 30:00)

generalize during inference to unseen domains. So basically this is called domain journalized semantic segmentation and like we as we can see it only uses label source data only and still we need to test on nine domains. So again at the time in 2022 2023 styization was like really popular and many methods were doing what we call domain randomization by actually doing some kind of um fancy styization uh on the initial source images in order to create many copies of that same image and training on all these copies. So this is a kind of style diversification but they were again only focusing on style diversification only. Content was not like taken into account. As you can see it's the same image. The car is in the same place. We don't have new objects in the image. u object I mean composition of the image like it still remains the same. So this was the first family of methods that were trying to do domain generalization. They it through style diversification. The second family of method actually was doing something slightly different but actually it was at the time it was like CNN based methods. They were as you can see here this is VGG they or even CNN they were doing it on CNN as well but the idea was to take like take many images from the source domain and then actually erasing domain specific features from these images and enforcing the model to learn what we call domain invariant features. Again it's the same problem. So we try to learn domain invariant features by eliminating domain specific features by having some instance normalization methods that were doing some renormalization tricks. So this is the second family of methods that were like designing some tailor made modules. But as you can see on the left, this is the — domain uh randomization techniques, the tailor made modules, but both of these methods actually this family of methods they were using image net pre-training um and they were also using some CN architectures and we were at the time we were in 2023 2024 and we all already have like different uh new models that are first transformer based that have large pre-training as you can and they also have a lot of genome power when it comes to diffusion models. So the question we're going to ask is how can we use now foundation models. So basically we felt that domain generalization was a the setting of little bit lagging behind the latest advancements in uh deep learning and we're going to use the new models to basically provide a new starting point for this setting. So the first question we want to try to answer to when it comes to domain generalization the first one is feature presentation. You want to have good feature presentations. Like I said I need robust feature presentations to generalize to any domain during inference. One thing I have is the clip vision encode. Clip was released in 2021 and we saw that it has like robust feature presentations because of its contrastive learning. it has been uh trained using a contrastive learning fashion on 400 million image text image data set uh called from internet. So basically what we will do is that we're going to take a clip image encoder as you can see we have a decoder in our case it will be a mask to former decoder and then we're going to do the training supervised training classic supervised training on the source domain that we have which is actually basically the only thing we have access to. So we're going to do supervised 20 but we're going to keep the clip encoder frozen in order to not um have catastrophic forgetting and to keep its initial feature presentations. So the only thing that will be trained here will be the decoder. When we do this, we see that we improve uh we improve the results by training on the source domain only and then we even outperform all the methods. As you can see from the left, IBN, ISW, shade, TLDDR, modify and HDA are existing methods that were doing um domain realization. Uh every method has its own technique. Some tech some paper some works were doing like texture randomization, style diversification like I said tailor uh some they were injecting some modules in the architecture. But as you can see just using a frozen clip and the mask to former model which is a really recent decoder we are outperforming the existing methods on three different data sets. So these data sets have not been seen during training. they were they are target domains that like have been provided to the model during the inference and we see that we improve over all the existing methods but I mean it it's not comparable because here we are using clip which is a stronger visual encoder a stronger decoder but rather than being comparable what we want to do here is to provide a new initial point a new initial starting point so that um upcoming works will build on top of this new foundation model and this is what happened actually in the next few years like in 2024 before 2025 many methods are now all the methods that are doing domain generalization for the task of symmetric segmentation they are using like either clip models dyno models um but basically

Segment 7 (30:00 - 35:00)

yeah foundation models so this was the initial the purpose of this work is to provide a new starting point for domain generalization but we are not going to stop here so here we have a clip model that is already better than all the existing uh state-of-the-art works but we're going to keep uh doing something more the first another question we asked is how can I simulate plausible target domains? I don't know the target domains that I will have during inference, but can I simulate them already during training? So basically again we're going to do synthetic data generation. So I need to generate sufficiently diverse synthetic data. We saw earlier that having diverse synthetic data is really important. That's what we will do here. And it's even more important here as we don't know the target domain that we will have. So we're going to use again text to image diffusion model to generate photorealistic data using text conditioning and we're going to something that we will do is that we're going to try to increase diversity already from the text conditioning by using a large language model. At the time lama was really uh popular. So we're going to use llama by actually asking it to give us a lot of prompts lot of text prompts as you can see here a photo of a busy road in daylight etc. a snapshot of a man crossing the road by actually g asking the model to give us some prompts that depict urban scenes because the only thing we assume is going to be the same going from the uh the training to the inference is that we're going to have urban scene. So this is the only metadata that we know and basically by giving this text prompt to a text to image diffusion model it will generate for us lot of images really diverse images and we're going to since we have this generated data we're going to do self training. So this is the first time that self training is done in domain generalization. Um so we're going to use actually we're going to self train mask to former on the generated data using sudo labels. Basically it looks like this. So we have our source training that we saw earlier. We have our generated data that has been generated using a text to image diffusion model. And then here we're going to do it in a student teacher fashion. This is one of the classical ways of doing a self training. So the decoder we're going to have an exponentially moving average of the decoder which will be our teacher. Basically I said that the clip encoder is frozen. So we don't have a teacher for the encoder. Encoder is kept the same. It's completely untouched to basically avoid any catastrophic forgetting. And yeah basically we have one stream which is completely like which is heavily augmented that is going through the student branch. So the decoder is giving us a prediction on these images but then this same image is given to the EMA to the teacher that is giving us absolutely basically. So by doing this is called self training meaning the teach the student is trained using predictions coming from the teacher but the teacher is basically an exponential moving average. It's a smoothened version of the decoder and basically we hope that the model is going to keep improving along iterations. This is the assumption that we have. So if we do this the performance we have before and after doing self training on different backbones like if we take like clip it which has a reset 50 backbone reset 101 coet flash if we do self training when we plug self training as you can see this is the MIU um the average MIU computed on three different data sets CDscapes BDD and mapillary we see that the improve is really marginal as you can see we marginally improve over the backbones different backbones but still the results is not really satisfying so selfwing doesn't really work if you do it this way. Why? Because self the images that we have here have never been seen by the model. So it's really hard to have reliable pseudo labels because this self trading technique is definitely tied to actually having really good pseudo labels. If you don't have good pseudo labels, self training would not work. So as you can see this is something we see here. will marginally improve. soft safe training as until now doesn't work and actually it's normal as you can see when we look at the data this is the image this is an input image that we have when we give it to the teacher this is the prediction it gives for instance this carpet is segmented as a sidewalk even though we don't we know that it's not a sidewalk this is not this this carpet should not be segmented as a sidewalk same thing for here we have sidewalks in the middle of a I don't know a garden for instance we have here also this um umbrella that is segmented as a building even though it's not a building. Also here as you can see the pseudo label is really noisy. We don't have a really good u a really good segmentation of this light of this traffic light basically can call it a traffic light. Um and that's the reason why self didn't work as you can see here. This is the highlighted reasons I showed you. So yeah we don't have a good uh a good segmentation of of images of these synthetic images. So yeah, like I said, we have noisy pseudo labels. As you can see here, So we can self training doesn't work. But something we can do is the following. How can we

Segment 8 (35:00 - 40:00)

improve the quality of pseudo labels? We have segment anything that was released. It was released like four months ago. Four months I mean at the time it was four months ago. It was uh really good at segmenting uh images but it was giving mask predictions meaning it didn't give us like um the semantic classes ass ass ass ass ociated to the mask. It was only able associated to give us like really sharp good mask predictions and basically SAM is a model that is again traded in a recursive way um as you can see here from the model to the data but the data is really huge. We have 11 million images, more than 100 billion uh more than yeah 1 billion sorry more than 1 billion masks and yeah it's a transform based method as you can see and yeah it can basically do what we call here um it's a promptable segmentation model basically you can give it a segmentation prompt it can be like some dots in the image uh some bounding boxes some course masks and it can uh give you the um the mask associated to these geometric prompts basically and this is the quality of data that SAM has been trained on as you can see like really sharp uh sharp images. So like the uh the key thing is about the data it's all about the data. So like for the SAM model really the uh the annotation is extremely nice extremely sharp. So that's why the model is able to um give us really nice MS predictions and that we will use this power to our advantage here. Basically we will use it to improve our pseudo labels. As we can see here, we have our noisip sudo label and we're going to try to improve it to get this. How do we do that? We have what we call here a prompt extraction module. So here this is a semantic segmentation map which is basically a class level mask like this. This segmentation map can be split into different uh into um uh into binary masks and each binary mask is going to be associated to a class basically. So here as you can see we have C here is the number of classes and then for instance for this uh person we can have a binary mass that correspond to only this person and also the uh the very distant persons here that are red here but they will be filtered using some filtering mechanism that we have implemented a filtering algorithm that takes out this noisy mask. So only the person remains this is the only like connected component that remains and then we're going to extract from it some uh some points some dots and since we know that again like I said we know the class of these points we know that they belong to the class person we're going to give them uh to the segment anything model will give us the um the nice mask corresponding to this prompt to these four dots and then we're going to do this for all the classes and then we're going to merge the all the binary masks that all the sorry all the masks that are given by segment anything model. So it's a way to reconstruct the segmentation map starting from the noisy map from the noisy segmentation map that the teacher gave us. So basically SAM is going to be used to actually refine the segmentation map that we have using this prompt extraction model. And then when we have like some uh for here for these black pixels, this is a case where we actually have two masks that are overlapping. When we have two masks belonging to two different classes that are overlapping, we're going to try to play it in a really secure way. So we're going to discard all the intersection p intersection pixels and then assign to them a black pixel which means that are they are completely undefined which is which means also that we are not going to supervise our model on these regions. all the black pixels will be discarded. And if we do the same technique for all these images, as you can see, we have many black pixels that start to appear. The umbrella got completely discarded because it's an overlap. It's a it's a it's an object that is completely unwanted and it may also be an intersection of different masks. So, this is something that we completed completely discarded. Same thing for here. As you can see, the mask got even uh nicer and so on. So this is a technique that uh is going to be used like we're going to use them to refine the segmentation maps the noisy segmentation maps and then we're going to uh and um we're going to also discard uh the one that are intersected between different masks. So we don't want to have noisy labels. The most important thing in self training is that you don't want to uh train your model on some noisy labels. Now when we plug our uh as you can see here we have the noisy pseudo labels we extract prompts from it and we give them to um segment anything model. So segment editing model takes as an input the input image as you can see here goes through here and then we have the geometric prompts which are basically just dots and then we get back the refined pseudo labels and then we're going to use this refined pseudo labels to actually um supervise our student model which is basically the decoder. When we do this the performance increases as you can see again evaluation on different backbones reset 50 reset 101 complex large the

Segment 9 (40:00 - 45:00)

performance improves and this is the name of our method final method which is clouds as you can see the performance improves u uh consistently across different architectures. Another question we're going to try to answer here is how much synthetic data we need because as you can see here we have synthetic data because we can but we can generate as many as we want like we can generate 1,00 100 million 1 billion images but what we saw is that after 5,000 images our model started plateauing in terms of performance. So this is the performance of our final model and this is here as you can see a bar plot of the number of images that we generate. Of course, going from 100 to 500 is good because you get more images, more diversity, more knowledge and the model keeps improving, but after 5,000 images, the model starts at plateauing. And this is because of many reasons. One reason is that actually the fact that we uh froze our clip encoder, it also restricts us. So the only parameters that we can have um uh that we can basically update is the decoder parameters. So this is one one problem that we have. So basically we wanted to we didn't want to uh like at the time ad we didn't want to make it complicated by putting some adapters inside the clip encoder or stuff like that but we just kept it frozen and also this also is is um it makes adaptation a little bit harder when you freeze completely uh the encoder and we know that the feature presentations are in the encoder not in the decoder. This is what is important when it comes to feature presentations. So the fact that we froze our encoder is something that also made this plateau happening. the plateau that you see here after 5,000, but also the diversity of the images. After generating a certain amount of images, we started seeing that like diversity was a problem. So, we started seeing like the same images getting back again and again our um also our uh pseudo label refinement module was not completely perfect. Sometimes we also had some noisy pseudo labels. So, for all these reasons, I think that's why um we started the reach plateau very early. Um this is because of these three reasons like the first one freezing the encoder also like makes the adaptation harder. the diversity, the lack of diversity that we had after generating like images using diffusion model. At some point we started having the same uh image getting generated again and also the fact that we have also like the pseudo label refinement uh module is was also like lacking some um was also like not perfect and if you go to in the paper you will see that we have shown some uh failure cases of our labeling module in the I think it's in the appendex of the of our work and yeah as takeaways basically u I believe that using L& M to increase text diversity basically this like we did is something important uh retaining the pre-wrain knowledge of clip also is important but again like I said it also like harm us when it comes to adaptation so maybe today using adapters inside the clip encoder or like only training the last layers of the clip can be interesting at least better than freezing completely the encoder and then using SAM to improve the self training pipeline. So again here as you can see we use different foundation models to advance the domain generaliz the domain generalization setting again here we don't have no we have no filtering on the generated data as you can see and uh something that is important is that for this self training loop to work really well we should have good initial pseudo labels so if you don't we cannot extract reliable um geometric prompts from it and if we don't we can't extract geometric prompts from it it's nearly impossible to recover and to improve the initial pseud. So um yeah to have this uh to have this module work really well like going from the teacher to um to segment editing model back and forth we should have like good initial pseud open vocabulary semantic segmentation. Uh so we're going to try to see a little bit how foundation models are directly used for the task of semantic segmentation. Um so basically uh in open vocabulary semantic segmentation as you can see on the image we uh do not assume because something we did here in the first work and the second one uh we assume that we are in a closed setting meaning that the model is uh can only segment a fixed set of classes like if it's cityscapes it's only 19 classes if it's uh I don't know AD it's AD data set it's only 150 and so on but in open vocabulary We assume that we want to segment any class and basically in order to do that we should use clip because clip has a text encoder so you can prompt it with any class you want. So this is what we call open vocabulary semantic segmentation.

Segment 10 (45:00 - 50:00)

So basically these are two words that exist in the literature clip deninoiser SFP uh clip denoser was presented in ECCV 2024 SFP in ICCV 2025. So these are what we call training free open vocabulary semantic segmentation model. Basically they are models that can again segment any class given any image basically. So as you can see here they have uh like they mainly focused on the improving the visual features of clip. So like I said open vocabulary needs the clip vision encoder and the clip text encoder and most of as you can see these are the main figures taken from these two papers. As you can see, most of the figure is about the vision encoder. When it comes to clip dinoiser, this whole green area is about how to improve the vision features of clip in order to achieve a better open vocabulary semantic segmentation while staying training free. Same thing for here. This whole region is about improving the text the image representations. But the text prompt is something that has been a little bit um used as it is as in its default way. As you can see, these are the prompts that are given the classes that we want to segment. They are given to text to the text encoder and we compute a cosine similarity between the uh the patch features or the image representations and the text representations. Same thing for here as well. We compute cosine similarity to obtain um to obtain the segmentation maps. And yeah, basically that's how it works. Like when it comes to the text representation, uh all the existing methods were using what we call the imageet templates. This is the classical way of doing um doing classification using clip. Basically, you have your class, you put them into what we call the imageet templates. This imageet templates actually have been uh prompt engineered in the original clip paper uh in order to improve uh zeroot image classification on imageet. That's why we call them the imageet template. We have 80 of them and then we put this class on all these templates one by one. As you can see, a photo of a road, a picture cartoon of a road and so on. And we do this 80 times. Then we give them to the text encoder. We obtain text representations of these 80 templates. And then we average them. So here we do the average of the all the text representations of um of all these templates where the road the name road is plugged into all these templates and then we obtain a good representation of the task and then we do this for all the classes. Basically if you say okay I want to segment these four classes. My objective is given any image I want to find these four classes into every image I'm given. You do this for all the classes. This will give you robust text representations for all these classes one by one. Right? So this is how it works. And this is how all uh open vocabulary methods were working uh um yeah to do segmentation. Basically they were using this imageet templates that were again like I said engineered for image classification. They were used for semantic segmentation. And we know it's not exactly the same thing. Symmetric segmentation is a slightly different problem and we will see why. So here like I said this is how it works. You have images, you have text, you have text representations, you obtain your robust text features, right? And then you compute cosign similarity between every patch and all the um the text representation and you assign to with the closest one. This is how you do segmentation using clip models. And this is how open vocabulary basically works. So the question we ask is what if I use only one template? So instead of having 80 templates here, what if I use only one? Like for instance, I'm going to say okay, I'm going to use only a photo of A. So here I don't have the average anymore. I have only a text representation of a photo of a road when for this it's a photo of a person, a photo of a car and so on. And then I compute my co similarity the same way it's done in all the existing open vocabulary methods. And then I get my segmentation maps. And then what if I do it again using a picture of a now I do the same thing. I obtain segmentation maps. And again I do this 80 times. If times because I have 80 templates now I have basically 80 different models because each template changes the text features. And when you change the text features you also change the segmentation maps. So I do this for all the templates. If I do that what I see is the following. Now I want to look at the performance of these 80 models. models and then here I compute a scatter plot basically where on the x-axis I have all the classes that I wanted to segment and here I have the performance of these 80 models. Basically what I see is that for each class some templates perform better than the average classifier. Basically like template. This template for instance is template maybe template number 55 and it corresponds to a blurry photo of a something and then you put this in this something you put the semantic class. Putting a blurry photo of a sky and then giving it to all the classes gives you a performance that is like plus 20%. With respect

Segment 11 (50:00 - 55:00)

like that is like plus 20%. With respect to this black line as you can see black horizontal line here it gives you a plus 20%. So using some templates alone performs better than the average classifier. And this is a phenomenon that we see for each class. So for each class we have some templates that when used alone are better than using all the 80 templates at the same time. So this was something that we were not expecting basically we're not expecting to have a big gap maybe because when we use the average uh because we want to have a good performance on average. But here when we look at class specific performance we see that for each class we have a one time play that performs way better than the average classifier and this is a curious um a curious observation and we're going to start asking some questions based on this observation. So that's the first thing. So the first question we ask is how can we identify the class experts without labels because as we can see here we are able to know these templates because we have computed the IOU right so we had access to labels we computed the performance of all the 80 models and then we saw that template 55 for instance is the best one for the class sky it performs plus 20% on MIU um on IOU when it comes to the class sky because we had access to labels but if we don't if I don't have labels can I identify these class experts So class express again is all the templates that perform better than the average classifier. So how can I do that? How can I identify them without having access to labels? The only thing I have is to use entropy. Entropy for instance is one thing I can do. It's an unsupervised metric. I can compute it by just having access to the predictions given by each template. And then I know that when the entropy is really low, it means that the confidence is high and vice versa. When the entropy is high, the confidence is probably low. And probably means if the model is well calibrated or sufficiently calibrated, it means that the model is making bad predictions. So we're going to use entropy as a proxy metric for accuracy. And we're going to do it this way. We're going to take a uh one template give it all the classes as we can see compute the segmentation maps and then compute what we call here a class specific um a classwise entropy. Basically to have an entropy value for the class road I'm going to take all the pixels that are predicted as class road. I'm going to compute entropy using this formula classical formula minus p and then uh average uh over all the pixels and then do this for all the classes. If I templates one by one I moves to template number two with a picture of a and then I compute a class-wise entropy. Basically I do this 80 times like I said here we repeat the process for each single template. we end up having a grid basically here as you can see it's a matrix where the columns are the templates and on the lines I have each row is correspond to a class and basically I'm going to select top k or top end templates for each uh for each class basically top end templates means the templates with the lowest entropy because that's what I'm looking for the templates that gave me the lowest entropy which probably means that the model was correctly predicting these classes because if I want to find the class experts that predict well and since I don't have again like I said I don't have accuracy I don't have labels to compute the accuracy I'm going to use entropy as a proxy metric and I will do this for all the classes for the class road T1 T3 T4 are the templates that have the lowest entropy and so on I do this for the person car I do this for the pawn class so basically I have as many experts as classes and now in order to obtain the um basically what we call the class experts predictions for the class road in order to construct S1 which is basically our experts on the class road in order to have like good predictions on the class road we're going to use T1 T3 T4 uh this and we're going to do the average the text features using only three templates the ones that were identified using entropy instead of averaging over the 80 templates which is the what everyone does we're going to only average over the three templates that we found and then compute the co similarity and then here we obtain segmentation maps where the class road is well predicted. So I have good predictions on the class road. I assume having good predictions on the class road uh using these three templates and then averaging them at the text feature level at the embedding level basically and then repeat the process for the class two. For the class two it was 19 five and 45 and then we're going to again average here and then obtain good predictions where the person is well segmented and do the same for all the classes. Here I did only four because I assume having access to only four classes. So now that I have actually I have as many predictions as experts basically for each pixel I have each expert is giving me a prediction on it on this pixel and basically I want to have one map I don't want to have as many maps as experts at the end of the day I am doing open vocabulary segmentation so uh given one input image

Segment 12 (55:00 - 60:00)

I am supposed to deliver one segmentation map right so in order to have one segmentation map now I have many segmentation map as many maps as experts. So how can we merge them? Some someone can say okay we can do the average for instance we can do a majority voting but what we will do is that we're going to try to follow actually the initial message of our method which is that we have experts and each expert is good on a given class. If we do the average we lose this message. So basically if we to we take the final map we take one pixel for instance an empty pixel for now we don't have a class for this pixel and we want to give it a class. We're going to look at the predictions of this uh for experts. First thing we will do we're going to we will only consider the experts which make a prediction on their class of expertise. Basically uh expert number three made a prediction said that this pixel x1 is class 4. Expert number four the expert on class number four said oh this pixel x1 is actually the it's a person. So I discard S3 and S4 because they made a prediction uh which is they made a they said they made a prediction on a class that is not their class of expertise. S3 can only I know that S3 is good when it comes to cars and S4 boats. So they didn't make a class prediction on their class of expertise. So they will be discarded. Basically I will only look at the expert that made that prediction on their own respective class of expertise. So S1 which is an expert one class road said okay this is a road. The expert one class person said oh this is a person. So now I have a conflict. Basically I have a conflict between two experts. Each expert is predicting its own class of expertise. So which one do I choose? I take the expert number two. Why? Because it has the highest probability uh soft max probability. So I will um when I have conflicts between experts, I select the one with the highest soft max probability because this one is probably more confident than this one. Uh so expert number two has probably a I don't know a soft max probability of 0. 8 eight whereas S1 has only a soft macro ratio of 0. 6. Uh so I'm going to take expert number two. In other cases no we do have it actually in some cases no expert makes a prediction on their class of expertise. In this case we use the clip model which is the class the average model. Uh so since I cannot trust any experts I'm going to use the classical model which uses the 80 templates at the same time. So we have a fallback strategy where we use a default classifier as I say here we fall back to the average classifier and here when we look at the results when we plug our method to existing open vocabulary methods uh mass clip n clip and clip dener which are like really recent methods we see that we improve over different classes uh different data sets sorry whe whether it's cdscapes with 19 classes to cocoo stuff with 170 we can improve over uh different data sets so here this is pascal VOC Pascal 59 And here AD with 150 classes. So we improve um using different methods over different classes. And yeah another question we ask is how many images we need to compute entropy because we need we have access to the training data. And um the question we ask is how many images we need to compute entropy to identify the experts. Um on Pascal DOC we improve starting from 25 images. So given 25 unlabelled images from Pascal DOC data set, I can compute a reliable classwise entropy on all the templates to select the good templates and then improve the method and uh and yeah we see that we only we I mean we only improve starting from 25 images when it comes to this when it comes to cityscapes for instance only one image is sufficient because actually the cityscapes is really a highly dense data set. It's a data set that contains many classes. Like here we see that it has like more than 12 or 13 classes in one image. It's high resolution. So I have a lot of pixels. So I can compute a reliable entropy on many classes. So one image is already sufficient to have a reliable entropy computation. So a reliable class expert identification that is going to lead me to a reliable also reliable results because I have good predictions. Whereas Pascal DOC as we can see in one image we can only have one or two classes. So that's why we needed 25 unlabelled images to start improving. So our method also can work with few images and yeah as takeaways uh of our method we the first thing we I would like to point uh to is that clip is really sensitive to text templates. Uh the way we construct text templates is extremely important and uh few images are enough to detect the class experts in our case. Um and yeah basically uh something to keep in mind is that influence grows with the number of classes because like we said we construct one expert per class and one expert is giving us a prediction on the whole map. So basically we have uh as

Segment 13 (60:00 - 65:00)

many as many experts as predictions. So in terms of inference when you have a lot of classes the inference time starts to grow as well and yeah this is one of the downside of our method. uh if we so if you are in a setting where like inference time is really important probably our method is not the best thing to use but if we are in a really critic uh research setting where we want to have the most uh accurate prediction and we are like fine uh like waiting and then to have a good prediction probably using our method is something that can be interesting when we use actually this um this sensitivity to text prompts to our adventures basically and yeah so for future works. I would like to talk to about some stuff like as we can see here in datum and cloud the first two methods that are presented we didn't have any filtering on the generated data and we know that the quality of synthetic data is extremely important and it's even more important today as we know that for instance when it comes to LMS they have already been trained on the whole internet and now synthetic data is something that is used to improve models so the quality of data matters and in our case we didn't do that like we wanted to keep it simple we didn't take um any synthetic any filing on synthetic data and yet I mean it's already worked but uh something we can think about is for instance basically we have something like looks like this we have a conditioning we have a generative model that generates synthetic data and this synthetic data is given to the task model but basically having filters before and after generation can be something important um by putting some filtering already on the conditioning and also after gener generation images. So this is uh by putting some I don't know some rule based methods some human uh um some yeah human features like before and after generation. This is something that can be done in order to improve uh basically the synthetic data generation. But something that can be even more interesting is having for instance a vision language model a really big model that can handle the filtering for us because we can put as many rules as we want. We may not cover all the cases. Uh but if we put for instance a visual language model that can for instance take uh text as conditioning look at the generated images and then see if whether it um it is good or not for training and then after that uh give back the image uh to the generative model to make some editing for instance or maybe change the text conditions like we can even like give a sentence to the VLM that can look at the sentence and maybe it can find some unwanted words or for privacy reasons it can delete some words and then give it back to the um to the RLM to give us a new sentence for instance. So this is something that we can do um to basically improve the synthetic data generation by integrating a visual language model in a feedback loop. Something else that we can do is to make synthetic data generation dynamic because here it's uh it's a static generated synthetic data generation. You generate data then you give it to the task model and then the task model start doing 20. Uh but something we can do is to also look at the dynamics of the model training like the model that we care about. We can look at the way it converges for instance and then we can maybe generate images that are uh that can be accelerate convergence or improve the model on some tasks for instance if after some few iterations we see that our task model is still not able to do um a given task when we evaluate it after I don't know 100 steps we can maybe give this signal to the VM that will now prompt the generative model to generate the images on a given task. So this is something that also is important I think to make synthetic data gen uh to make yeah synthetic data generation dynamic uh along uh the uh training of the task model. Uh another thing as well that I believe is important is that like we saw in the beginning we were talking about closed set semantic segmentation meaning that you have uh images you have a fixed deep learning network that is trained on a fixed set of classes that can only segment cl uh like only a fixed set of classes. This is how it started in be in the beginning with clip. We saw that we are in a now we have open vocabulary semantic segmentation models which basically you have text prompts you have images and then you can segment any image you want. But basically in all these methods that we have today we assume knowing which classes we want to segment because in the text prompts you put the classes that you see basically on the image right like if you take an urban scene road uh an image with cars and pedestrians but in the text prompt you put lion zera and elephant the clip model will segment all the pixels as lion zera and elephant even though it's pedestrians and cars and sky and vegetation. Um and this is a problem actually because in open vocabulary semantic segmentation we have an assumption we make an assumption is that we know the classes that we want to

Segment 14 (65:00 - 70:00)

segment. Yes we put them in text prompts but we you have an we already know what we want to segment. In some cases you don't know what you want to segment and in this case it becomes what we call vocabulary free semantic segmentation and basically it involves having a for instance a VLM again where you you give it input images the VLM recognizes all the classes that are in the image and then you use these text prompts that are given by the VLM to the clip model to segment the images and this is called vocabulary free semantic segmentation and um this in this case you erase the human in the loop and you only keep it like fully automated. You are given an image you don't know what you see in the image but still you can recover all the objects in the image. So this is slightly different than open vocabulary it's called vocabulary but it also induces many challenges such as like recognizing all these small objects in the image and uh yeah it becomes more difficult when it comes to yeah high resolution images and then like small objects in the middle or something like that. VLMs today struggle also when it comes to spatial recognition especially if you have like really high resolution images and yeah as broader perspectives I would like to talk about three things first one is that like yeah domain adaptation this is something that I uh get asked a lot like whether domain adaptation is still relevant in today's era where we have foundation models that have like a huge pre-trained knowledge that can work really well on I mean yeah they have like really good zero shot performance I still believe that domain adaptation can interesting when it comes to data poor domains like microscopy, historical documents. Sometimes we have like one copy of a historical document that uh some people found in like uh yeah we can have like for instance like his one historical document that we would like to for instance classify or um do optical character OCR on this historical documents and it's sometimes can be only one image and in this case you still need to do domain adaptation adapt to this new domain. So I still believe that domain adaptation has still a role to play when it comes to data poor domains. Something uh we see today is that we don't want to have a static do models. Basically we are moving from unsupervised do adaptation to test time and continual adaptation because we want to have a model that is that keeps on improving doing inference. Uh today we at least the classical models we used to see in the last years are models that are good but once they are deployed uh at inference time they do not keep on improving. So working on test time adaptation and continual uh continual learning is something extremely important in order to adapt to the new incoming domains. And the last thing is multimodel domain adaptation. Foundation models today are really big. Uh they perform really well. They are becoming uh multimodel like image text audio etc. And I believe that multimodel domain adaptation which basically can involve like adaptation on one domain without breaking the alignment with all the others is something also uh something also important. So yeah multimodel domain adaptation is um something I believe that is going to be uh important uh because foundation models are I mean integration many modalities. So yeah so basically that's it for me. I would like to thank also my supervisors and my collaborators who um yeah basically helped me during my PhD. Uh yeah I learned a lot from them. So yeah thank thanks to them as well. — Cool. Um excellent presentation. Uh yeah so uh we may dive into the question answering part of it. Uh if you folks have any queries, yeah, please feel free to unmute yourself and ask — uh if not like — Yeah. — Hi. — Uh thanks for the great talk. It was very insightful. Um I want to touch on it. you kind of touch on touched on it on your last slide. So this domain adaptation for data po like think about medical images microscopy or even worse uh industrial images where let's say a lot of data is uh IP sensitive so it's never public right so these expectational models trained on billions uh dinov2 150 million images dyno v billions of images but they never get to see this data — um and then What we can see is uh like syn augmenting data set with synthesis uh diffusion models are failing to let's say generate these images. Yeah. — So, do you have what's your I mean you you quite the expert on this whole domain now you have these p what's your insight on this? Uh

Segment 15 (70:00 - 75:00)

— I mean yeah I agree with you that of course like when you have some like you said either for privacy or because there is only one document. I saw a presentation like ICCV. Yeah, it was a keynote speech in ICCV where the um researcher was presenting a work about like the some archaeologists find out one document. They found one document and they wanted to do OCR on it. They wanted to detect all the characters and in this case it's a pure one show domain adaptation sitting because you only have one image one object and you want to do OCR on it. Uh in this case you are automatically data pool because you have never seen a document like this one. Uh matter of fact you actually dig it. you found it maybe 100 meters below the surface. So uh so they found one document they wanted to do OCR on this really uh document in a noninvasive way because they don't want to like u like uh touch the the historical documents so they still want to do OCR on it. So this is a classical uh domain adaptation setting I think and it's a oneshot domain adaptation setting. Uh and yeah foundation models never saw this document. So they can maybe cannot work really well um on it. They may not do OCR on it in a really nice way. So they will struggle when it comes to this one historical document. And um in this case I think yeah using synthetic data in order to generate new historical documents synthetic historical documents and then training the model on this synthetic data is probably one of the I'm not going to say the only way to go but it's something that will help for sure because we know that because we we use an a classical assumption. The assumption is that when it comes to neural networks the more data you have the better it is. So uh generation synthetic data that looks like this single historical document that some archaeologist found like 50 m below uh um 50 m below is one way to go is probably one of the primary uh things to try. Uh so yeah I think yeah synthetic data has still lot of things to do especially when it's coupled with domain adaptation — right. uh just what I was thinking was maybe techniques like dream booth uh let's say you think about style transfer but more sophisticated like personalization of the diffusion models to this target one single OCR image and then I think we are kind of constrained that if the model works there so if we are able to let's say fine-tune the diffusion model to generate the synthetic data of this really this one single historical OCR image — then I think we are But I was so this is to be I don't know but I'm just wondering if since the these models haven't seen this data they they've been mostly trained on uh natural images uh if the diffusion model phase there so maybe we struggle to do anything downstream well it's just a thought — yeah I mean definitely yeah if the diffusion model struggles to generate like synthetic uh historical documents then of course it's going to be a little bit difficult But then yeah it may need a little bit more um maybe we should look at for instance take the diffusion model look at all the historical documents that we have so far and then first like fine tune the diffusion model uh using all the historical documents that we have so far and then we we first going to have a first version which is going to be a diffusion model that is fine- tuned on like maybe hundreds thousands of historical documents and then from that point on using dream boot uh to fine tune to this only specific historical document that we need to do OCR on. So maybe it can be a two-step technique. But um — yeah, I think I liked about your talk was this whole pipeline approach of using let's say the strengths of different boxes and joining them in a sense. So yeah, you're right. I think yeah that's would be one way to go and to be seen by one way to go but of course it definitely like we are we rely on the pre-trained knowledge of the diffusion model. If the diffusion model is not good enough to generate stuff for us, it's going to be it is going to be difficult to do what we did in this in this work. Yeah. — Okay. Thank you. — No worries. — Cool. Uh so I wanted to ask about from the world model perspective specifically the uh Yan Lun's approach to that with uh Japa and uh this was the paper that I was reading today the Lee world models this game came out I think so uh today or yesterday it itself yeah and they like characterized the latent world model approaches and did uh 15 uh million parameter training on a single GPU and like uh specifically

Segment 16 (75:00 - 80:00)

using Japa and uh this trains like uh good they have added a lot of uh training stabilization features as well as the data set the model architecture and like they discuss in this paper about uh the planning with uh both the latent plan planning and like how do we uh go towards uh the physical understanding and uh so do you kind of have any thoughts of like how can you uh use the your existing research to optimize world models or like a set of uh like uh ideas that could use this your existing re research to uh like uh pave way for a new like generation of uh like uh domain adaptation models that could uh like in a world uh model view can automatically uh generalize tries to uh like uh your uh low resource uh images that it hasn't seen much. I mean I I think yeah I mean um yeah when it comes to these kind of world models that have a physical understanding uh something that I mean if yeah if we can have a generative model for instance or if we can generate data that has um that follows a physical laws for instance because it has been trained on some on yeah if we can yeah generate like synthetic data that follows physical laws I think we may even need like less images to do syn to do synthetic data training because we would have a better quality of synthetic data. Um I honestly didn't read the paper still yet of reward model but yeah I do believe that yeah if we because it's yeah synthetic data training is deeply tied to the quality of training data we know that training of few qual few high quality images is way better than training on millions of lowquality images and um and yeah uh adding physical understanding is something definitely something that will definitely help because at the end of the day you want to have a model that can interacts really well with its environment and the environment is deeply tied to physical laws. So generating data that already depicts some physical laws is definitely going to be helpful. — Yeah. — True. And like the — Yeah. with the recursive approaches too like I feel like uh specifically maybe you have heard about the ARC AGI challenges and the uh the team that led the ARC a AGI2 used a small recursive model 7 or 8 billion one to uh like uh that was uh specifically uh for going into the challenge of uh unknown uh pro problems set. And what they used was they also generated this synthetic data uh which had uh multiple uh like uh generalizable features that uh they could have they could use in their training uh pi pipelines which actually helped them to uh like win the arc AJ2 on uh Kaggle that happens like uh that happened last year. So uh maybe combining that with the recursive uh and uh plus your uh low resource or like uh your uh images that or your uh data that isn't uh uh spread uh that can be used here too. — Yeah, definitely. Yeah. Thank you. — Anything else? — I have a question. — Thank you for the talk. Firstly, uh I have the question like it's a more technical question from the third paper, the floss paper. — Um so you from the I like the I the way like you deal with the template selection based on entropy. Um that's like a good metric in my opinion. Um so what I was thinking is did you guys ever thought about like doing this uh selection of template like in an end to end fashion for example um I was thinking maybe you can learn a vector of uh weights or like a weighted vector selection and then you let the model

Segment 17 (80:00 - 84:00)

based on entropy loss or feedback signal and then you let it do this instead of like you a person like obviously it will be tedious for like for a few tasks for a Yeah, — I mean I that's a nice question. Um yes I mean we can do that like may maybe if you are referring to prompt learning basically it would be a kind of prompt learning thing because we would uh let the model by doing gradient descent like selecting or giving weights to all the templates and the weights will be something between zero and one uh and also summing up to one after softmax for instance that's yeah this is something that we can do but in this case we would break the training free idea because in our case we don't do any training like this work like the templates selection has been done without doing any training. Of course, the merging was done also without doing any back propagation. So, we would break this uh training free um idea. But definitely if you want to relax the training free constraint, yeah, definitely this is something to do. Yeah, prompt learning to do prompt learning and then of course, yeah, letting the model even like find other prompts because here we are uh here we are selecting from 80 prompts as you can see here it's 80 prompts. we are selecting from them uh the best prompts for each class uh the sorry the best templates for each class and yeah as you can see here it's degrees it's only 80 but if you do prompt learning and then you let the model choose maybe you can even find other templates or other regions in the textual presentations that better segment the classroom but in this case you break the training free constraint in our case we wanted to keep our model like really light so we didn't have any training in use in our methods but yeah there definitely something that we can Yeah, there some to yeah to let the model choose by doing some gradient descent by using even some entropy minimization something like that in order to go in the direction what we can call it like direction of expertise for each class for instance — I don't know if this answers your question uh — uh thank you for that answer — yeah that is a excellent point too uh with the entropy minimization uh — for instance yeah that's one way to do uh yeah just would to add something about it like entropy was like the first technique that we used and then we like we had some good results with it but after that we even explored other uh more complicated metrics unsupervised metrics. We went into what into a research area which is about like unsupervised accuracy estimation and we were using other metrics and we showed some of them in the paper and we saw that entropy was working really well like entropy is simple but it works really — even better than more complicated metrics that were trying to assess the accuracy of the model. So entropy was our best unsupervised metric we found but yeah we I mean uh we show that it's better than three or four other metrics but I think that we someone can even try to take our work improve on it because entropy may not be the best and for that even in on the GitHub repo we have um we have a collab notebook that people can use and then can try to find better metrics than entropy in order to improve on top of our results. Yeah, true, true. specifically like uh with uh this uh all also you might have noticed about uh a hierarchical uh latent entropy mechanism that like uh has been uh like shown in the literature that has came out in the past couple of years in which they're trying out uh the different layers of uh latent uh by a hierarchy and uh via this entropy maximization they are uh kind of uh templating it and uh to follow a particular path uh specifically in your m embodied uh like your uh robotic manipulations and such. So like that is a uh very good path to follow too. But anyways, yeah. Uh I'm digressing here. Uh any more queries? If not, I think so we can call it a call. Uh yep. Uh thank you for your excellent presentation. — Thank you very much guys. And yeah, thank you for inviting me. Thank you also to all the persons here listening to the presentation. Thank you very much.

Другие видео автора — Cohere

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник