NeurIPS 2023 Poster Session 2 (Wednesday Morning)

44:16

NeurIPS 2023 Poster Session 2 (Wednesday Morning)

Yannic Kilcher 16.12.2023 10 611 просмотров 214 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Papers: 0:30 - Bifurcations and loss jumps in RNN training https://arxiv.org/abs/2310.17561 8:40 - LEDITS++: Limitless Image Editing using Text-to-Image Models https://arxiv.org/abs/2311.16711 13:50 - Lexinvariant Language Models https://arxiv.org/abs/2305.16349 19:15 - Transformers learn to implement preconditioned gradient descent for in-context learning https://arxiv.org/abs/2306.00297 23:25 - Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation https://arxiv.org/abs/2305.01569 30:40 - GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization https://arxiv.org/abs/2309.16020 37:00 - Hardware Resilience Properties of Text-Guided Image Classifiers https://arxiv.org/abs/2311.14062 Links: Homepage: https://ykilcher.com Merch: https://ykilcher.com/merch YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ykilcher.com/discord LinkedIn: https://www.linkedin.com/in/ykilcher If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (7 сегментов)

Bifurcations and loss jumps in RNN training https://arxiv.org/abs/2310.17561

insane hi hey big fan nice meeting you I'm recording a bit is that okay yeah sure can you tell us about vations in RN TR in RNN training oh okay now I'm getting nervous like General rundown like we were interested in finding uh hope RNN training Works um why we have sometimes problems the way we did that is so for example first we use like normal R activated RS yeah and the reason we do that is because we're analytically practical so we actually have a close form solution for fix points and K cycles and that we can calculate like it's just this equation and the bation is just like a qualitative change in your dynamical behavior of your system so changing from fix Point Behavior to CYCC Behavior so what kind of stuff are you training here do you somehow assume an infinite length sequence of inputs or so how do you talk about fix points for the benchmarks we lose like uh Brands like Lawrence system and like that and for example here we have like empirical uh data from membrane voltage tracers of cells so we got a recording and we trained on that yeah and what we found is that during training like you have like this big jumps in the loss and we use our um algorithm to analyze what kind of Dynamics are going on okay and what we had here uh was like we have a fixed point after the chump so you see before the chump we have a fairly well trained system it's not perfect but you see it's like close to uh what we actually want and then suddenly at this point we go to basically Infinity yeah last time we have like Divergent Behavior can you tell us a bit so what's the data exactly here that's a Time series exactly it's a Time series of a membrane voltage so we recorded not we but like the people we got the data from they recorded it on a cell they recorded the uh the membrane voltage over time so you yeah just toage and do you think the problem has to do with the fact that there's these spikes in the data and so on and these might be discontinuities in general you always see like this kind of behavior if you train rnms like you always have like this trumps okay uh in the loss because of those types of fication even if I train like a language RNN or language RNN like an R an RNN on uh I don't know like some sentence of language yeah I would say like in general what we observe is that always it looks something like this usually you have always those uh Chumps in the L not always that big like that is like exceptional but usually always have like um weekly behavior in your loss you're never like completely uh straight and it's just because your loss landscape is like uh very crooked at some point you have like those Chums in the loss landscape what we wanted to do is calculate all those dynamical objects we want to find the fix points we want to calculate the case cycles of those objects we can relate that to the bation curves like for example this is a 2d example this is like analytically calculated um bation curves so we know for example here this srl this is the region where we have a stable two cycle yeah um this would be where we have a stable three cycle and here we have like multi-stability so we have both existing at the same time and this is all analytically calculated and now we and it give us a background a two cycle and a three cycle are what ccle and just be like yeah you have like the time series here so two cycle would be the red one where you just St jump between two states and the would be juming between three states and those states are weights like you jump from one weight configuration to another no you have one fixed weight configuration since a Time series produced by your model like States jump States I see yeah and uh we try to reproduce that with like um our approach which is a semi analytical uh but we calculate just like at on a grid here we calculate those dynamical objects and we actually find all of those by curves okay this was basically a validation for our code to see that we actually work the problem with like this um analytical approach is you have an equation like for a fix point in every single of those linear regions given by your relo activation function and in generally you would need to solve for every single of those linear regions which is going to be tedious if you have like a 100 dimensional RNN uh like it scales with two to the power of your dimension like the quadrants um so the idea we came up with was is sistic where you like initialize in one random quadrant and then you calculate your equation like the definition would be now if the solution lies in the same quadrant then you have a real fix point so you will just stay there um if your solution lies in another quadrant then will be a virtual fix point so if you are here you will feel attracted to that but uh as soon as you cross over it doesn't exist anymore so yeah um the idea would be now to reinitialize the algorithm the calculation here in um that virtual in the quartum of that virtual fix point so wherever you go or you re exactly and it works surprisingly well so here is like a scaling example the black one would be like just uh if you would randomly pick a quadrant and uh calculate it there it would uh scale like that and those are like how our algorithm scale so it depends a lot on like the IG vectors uh on the Matrix Norms yeah um but in general it always scals better than by just randomly looking at those quadrants yeah and then we use that algorithm to actually calculate bation Curves in like for example in this toy example and you see all those like this is the loss gradient and uh the actual loss and you see that uh all the Chumps in the loss are actually lined by bation curves yeah which means uh whenever you have a chump in the loss it's probably caused by a bation yeah uh but not every bipoc causes like a l chel always depends and like coming from that toy example we did that on the real data and what we found uh was we find these types of fications in that training as well the cool thing is we cannot only analyze training process but we can analyze a fully trained system so our goal is not just to do good prediction but we do dynamical systems um dynamical systems to construction yeah so we actually want the full dynamical system and not just good prediction and we once we're trained here we can calculate all the dynamical objects like what we have here is for example stable 39 cycle which would describe the cycle you can see here okay but that's not everything we found we also found like a stable fix point that exists at the same time and then uh we have this unstable 9 cycle uh and we have an unstable 39 cycle and the question is now like if we have a real proxy for the biological system this should have a meaning like this fix point would mean something yeah so this is what we're going to work on like in the future like uh relating that to actual biological uh things coming back to the training yeah what we found is that like a form of teacher forcing like generalized teacher forcing uh that was proposed at icml this year um actually helps a lot in getting rid of those crazy jumps okay like this degenerate transcritical bifurcation that causes that jump uh is not possible if you choose your uh Factor Alpha here correctly so you smooth your loss landscape and you make RNN training easier like and much faster okay so you found why this method helps exactly yeah we found out why this actually works yeah pretty cool very nice the fix points that you found you said they must be explained somehow by Something Real the 39 cycle six point fix point would be the spikes do you have a clue what the nine cyle will be so it might still be just an artifact from the training so this is something we have to figure out and uh you have to find metrics to say okay is this now fully trained are we really like do we actually have a proxy for the system yeah but yeah that's like the work in progress what we actually want to do now like uh sometimes you just get artifacts during the training process and sometimes uh it might have a meaning but yeah that's something we want to figure out now very cool thank you so much than

LEDITS++: Limitless Image Editing using Text-to-Image Models https://arxiv.org/abs/2311.16711

a thank you man nice to meet you radi let's focus on this one tell me about it this he's very cool follow him on Twitter l++ is a technique that leverages the semantic guidance but for editing real images while semantic guidance edits images on the lat space B of stable diffusion l++ add its real images so but we first do inversion so we take an image and we invert it back into the latent space but we do DPM solver Plus+ inversion which is an innovation on this paper because no it hasn't been done before and basically you can do inversion in as little as 50 steps maybe even less what kind of models do you invert uh basically you can do any pre pre-trained text to image diffusion model so here we do stable diffusion 1. 52 or XL um and essentially it's the combination between the inversion then applying the semantic guidance technique so we can uh guide for example Yan leun to be George Coy with sunglasses so you do two edits at once it's and it's all inference time you don't need pre-training you don't need uh further fine tuning and uh another Innovation from the that's y++ from the tech te is that we do centic grounding so these masks are already free on this table diffusion unit uh both the unit masks and also the noise masks and when you use them you essentially ground your edits into the the model in it what does it mean using them do I have to do something or so basically this means that so here for example uh no masking and no pipelining model so you don't need to segment anything connected basically just type cherry blossom and the model finds that uh already it has the knowledge that there is a tree here so it replaces the three with the cherry blossom without you having to point where the cherry blossom is and also you remove the yellow car so this is the original image right the car was yellow so basically we remove yellow car we add Green Convertible and we add cherry blossom and uh basically for free without saying without masking without saying where the model semantically finds the cherry blossom the it removes the yellow car and it adds a green convertible but you see it kept a lot of the structure you kept the background and this but it does the mask after it did this or it does it uses the mask to do this it so you type because there is no cherry blossom here right so you say plus cherry blossom it would kind of semantically find that cherry blossom is a tree and therefore the tree is here and you could use that just the Reconstruction so you don't where do you start from reconstruction yeah um when you want to do an or edit a real image you first have to find the not from which you start to actually edit the and that's why you have to reconstruct the yeah I think Reconstruction is maybe not the right work here but anyway so you start from from this image you invert it um and then when you want to do the text editing as po was saying the word is grounded in the image so the model has a knowledge on Where cherry blossom is relevant in the image and so it limits the editing already to the based on the uh model's knowledge to the relevant image region that's quite cool because like we're really focusing on the relevant Parts um and get Mas for free Yeah so basically this for example is what you would do if you just do the inversion reconstruction and the textual editing without the the mass grounding yeah you see that it actually does it right you do add the cherry blossom and you kind of change the car but like it looks a bit like the the car light is a bit has a bit of the cherry blossom Vibe like with the pink and the the car kind of changed quite a lot yeah um so with the grounding by using the knowledge of the model of the masks and it's an averaging of both the unit and the noise yeah um it manages to actually do way better edit nice very cool and this all for free all inference time the only tricky the only tricky thing is that you have to do inversion so right so like the first step you have to like run the 50 steps of inversion uh to then but then once the one image is inverted like once you have all the Laten you can reuse you don't need to do inversion every time you do it once and then in is free theion takes roughly a second so it's still quite quick yeah but uh and it's also reconstruction ER free so it's like perfect inversion uh and fast inversion so it's not really a downside I think very cool then excent thank you enjoy rest

Lexinvariant Language Models https://arxiv.org/abs/2305.16349

question it okay if I record a bit feel free I will try my best very cool thank you uh so this work we basically investigate is it possible for that model to be performant without any fixed token betting so fixed token betting is basically it has been here since the first world to so it's like mapping from token to fix M if you don't have this that means the token no longer have this identity being incomex independent identity you only have token meaning by how it acts the comp so Lexi varies is basically like this uh idea like these three phrases are equivalent because it's only different after two lexal mapping so letter e here you can obviously is a because e appears here and here and the same c as a yeah so this is a Lexi varus so we basically show that this Lear L model can converge to standard length model in terms of next token prediction which is very surprising because you have this property uh and we show empirically also does the same thing so Le VAR L model uh we kind of constructed by replacing fixed token embedding Transformer with random gion embedding so basically this uh the E here will receive the same token edding the same random gion Vector place in the sequence but across different sequences the same e will have different B oh did you like kind of intuitively explain how the Rand go in Yeah so basically as saying like basically you the letter e here right it receives I just like literally romly sample caution Vector for M and the same e here will have the same R caution bding I replaces and so similar C has the same ring and K hasing oh like the ined is like the first occurrence of that character whichever m you can secure that okay and butross different sequences your same letter will have different C bding so that you force it to have larious property what do you predict then do you predict into a standard vocabulary uh no you predict in the still in C Bing the random C Bing so let's say but what if you don't have that particular letter that you want oh you don't have it yet that's a good question so then you cannot really predict it out so that one will just be you can't do anything there okay does that mean you have to uh no you can do any tokenization you just need to replace whatever token uh question but um so like right if it's all random and let's say you have will this be able to predict a letter that's not there no that's the thing won't be able to or a token that's not there right so if you tokenize so if your token like if your vocabulary is really large if you go beyond like character you do like yeah that's why it need a really large context and all the C in the input if you CH the input it's not so usually and this is perplexity on language modeling on an RNN uh no this is a cal proof okay that this is nothing to do with this is all Transformer so all the empirical stuff of this deer only Transformer okay and perplexity is usually measured sort of in on the log scale so this difference here would actually be quite big no yeah this difference is that yeah so it kind of derives that uh all it does from just kind of how tokens relate to each other yeah exactly okay yeah so that's a key uh and we also show that model performs incomplex to suffering with and so basically because it literally it sees all the substitution Cipher as the same sequence so it literally performed impul this com deciphering you can prile the cipher Key by uh like just like tring a linear L on top of the Frozen yeah uh and we finally show that s model is subtit better at the symbol manipulation task so symbol manipulation task we have two tasks so one is a lookup table so you have a lookup table and you predict the next token by looking up what's token that's to and the other one is a pration so you permute the sequence the same way as the demonstration and we show that we achieve 4X improvements on this but these are very constructed to favor you a little bit okay very cool so does it mean that a lot of language is just sort of how things relate to each other and not just the absolute words yeah okay it's very interesting very cool thank you y nice to meet you Transformers Implement

Transformers learn to implement preconditioned gradient descent for in-context learning https://arxiv.org/abs/2306.00297

preconditioned grading descent for Inon Ah that's very cool so you figure that if in context Learning Happens what's actually happening is some sort of oh yeah we are yeah we exactly idea is that if you want to solve this square with the Transformer and you train it on the random data generated to Sol Square the network when it when you increase the number of layers the network cross the layers in implementing some algorithm which is a precondition gradient this how we can find this characterize it to study the structure of the object itive that is used to optimize the parameters you characterize what is the optimum parameter you plug in it there and see what happens and then you can interpret it as an Al so how do you get a Transformer to even tackle least squares what do you do as the inputs and outputs sure the inputs are x i y i in context example of L Square we encode one example of so all of this is in context and then this is the sort of next token you predict okay and that's you solve these that's very very cool and then you train on a lot of these examples okay I would expect that Transformer to just kind of memorize a lot of stuff it seems like you're more you're actually doing an algorithm internally that doesn't mean okay that's very interesting so you can there is memorization as you said that's correct the memorization is in the so suppose that you have a data that is with some then the optim parameters that you find depends on the sigma Matrix depends on the spectrum of the sigma Matrix so there is a memorization exactly the algorithm ads to data yeah become PR but it's not that different okay the memorization accounts for the preconditioning it's not about the existence it's actually very cool and then yeah how do you make this actually in how do you interpret it into an algorithm then so once you have the preconditioning oh mat very good question because we are updating data not so we don't have of this Square the way that you can connect these two is that you can prove that this Y and plus has different instances across the layers it's getting updated across the layers when you pass the data across the layer this why at the beginning it was Zero it become something and something at the end also you track how it evolves over the layers and that's kind of the algorithm then this would be equivalent of projecting this gradient step on the xn + one okay so this is the way that you show this theoretically so theoretically for single layer we can say Global Optimum is exactly this algorithm you can say what is a preconditioning for multi layer the result is weaker wey we prove there is a stationary point that is great yeah we couldn't say it's Global minimum local minimum just a station and then you try it out in practice we optimize it and we found that is even for multi-layer we are converging to the same structure that we character the same let say stationary exactly very cool very Awesome's the implication thanks for the

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation https://arxiv.org/abs/2305.01569

explanations hello nice me you I've seen your videos is it okay if I record a bit of course of course I'd love like to give you the speed if you'd like to hear it I I would love to uh amazing yeah so uh so basically the motivation of doing this work was that we saw the very great potential of uh human preferences in language domain yeah you obviously know with from open Assistant chgb Etc and we wanted like we wanted selfishly we wanted to do it in TX generation but the problem was that there was no large publicly available data sets to do it with so we decided that we will uh that we that we'll collect one and open source it mhm so we did it by creating website that enables users to generate uh images from text for free that's actually like screenshot from the website so the users were able to uh enter the website write their own inventive prompts generate images and to see the next image they needed to choose uh which image they prefer yeah this allowed us to collect the data set that contains more than one million examples of human preferences in text to image yeah in which each image contains a prompt two images and a preference for which image is preferred or if there is no like hard preference and you can take a look at this data it's very very different than the data that you'd see Ms ccoo for example M the images are much more diverse much more interesting and the prompts can be like can be funny scary and not boring like msop and cool so this data set is great but we also wanted to show that we can actually utilize it so it's kind of like so what we've done is we trained a scoring function that we call pick score it's clip a scoring function that is supposed to estimate the level of satisfaction of the user from the image given the propt so that's a reward model exactly we wanted to return a reward model put it simply yeah and the input is a prompt in an image and the output is a score yeah and we used like we used to put it bluntly we used analogous subjective of instruct GPD and then when we measured H pix score's ability to predict human preferences we show that surprisingly it's not only better than existing scoring functions yeah it's even better than experts and that like those people like say like but wait that doesn't make sense was created by humans so I can be super human and the core distinction is that the user that created that example has much more information that the crowd worker yeah the crow worker is only can only get us input like the prompt and the and the images yeah they don't know like what the user had in mind what they envisioned and the model needs to pick up on that or the craft worker so I think that it's it like it shows an important Insight also this fact that you also like probably saw in open Assistant that real users she can provide you with much higher quality data than crowd workers yeah here your incentive was like this is a free image generation yeah the incentive for the user for the users was like free very high quality image generation and our incentive was we want to uh open source stuff for the community and to uh P research yeah so using the scoring function we show that we can perform much better model evaluation M than the one that can be done with fid which is the standard metric for evaluating text to image models yeah but it shouldn't be because it's a very bad metric that doesn't even consider the PRS yeah so pix score correlates better in model evaluation better than fid on Ms Coco yeah and better than other scoring functions on um yeah this is it's small difference no this is minus uhuh this if I shows negative correlation but then I could still predict yeah just by taking the negative yeah but see like it's like yeah actually I agree but uh but but yeah that's a fact if it's so Nega that negatively correlated yeah that's a fact but I'm sure that people won't like start like uh oh we've been doing it the opposite direction like all isn't the D for distance yeah for Inception distance small distance good yeah but here we took the winning average between the models okay so we took obviously we took account for the fact that F like smaller obviously like okay good and cool so we suggest you update the standard protocol yeah for evaluating text to image models by taking uh promps from Pik a peek yeah rather than Ms Coco captions and to use pix score rather than FID Y and can you also use PX score to evaluate let's say Gans without chrum so that's a very interesting question and I'd love for followup work to see how well generalizes yeah I'd love to see it I don't have like a definitive answer for that H yeah so the next thing that we try to do is we try to think of like the simplest way to improve text to image generation with big score yeah so what we did very navely we just took the vanilla model yeah generated a bunch of images yeah scored them and just return yeah selected the best yeah and we can see both qualitatively and quantitatively yeah that pix score usually selects a better image both than the original model and more than Aesthetics predictor or clip MH and finally I think that it can be interesting to um to put some context before because the those conference these conferen has happened like several months after we publish yeah so since then the both scoring function and the other set was pretty widely adopted yeah and two notable works that I really loved and like wanted to like highlight a bit in my poster was making P score a metric for text guided video editing benchmark H to measure video quality and very interesting work by wedal improved SD Exel by extending DPO for diffusion models and doing it on p a PE data set I see so and they were able to improve sdxl and I'm going to guess this was sdxl or some Sable diffusion already it was lots of different PS we wanted to make it it's more diverse yeah cool awesome so it's this is it is actually the whole Loop is a bit like reinforcement learning from Human feedback right like ultimately okay definitely yeah and uh yeah I hope that people will continue like to look at the stuff continue to open source and uh yeah awesome thank you very much yeah

GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization https://arxiv.org/abs/2309.16020

nice meeting you in person nice meeting you not even read the title so you have to introduce me so this paper is about image geolocalization so image GE localization means giving image we want to identify the GPS coordinates of it GPS coordinates means the latitude longitude of it right so suppose you are given a particular image you want to know which location this image has been taken from yeah the problem here is this image can be taken in the any world any any part of the whole world yeah so you have to match it throughout the world and identify the location of it what kind of data do you have available Street View okay yeah so that's so this is like geogas kind of yeah yeah so it's like a worldwide image right you want to do at the World level Global level so existing approaches they do it VI classification based approach so they divide the whole earth into grids and then they classify which grid the particular image belongs to but the problem is the grid center location is the GPS coordinate always they say that but if the image belongs deviates from that then localization cribs in and also number of classes depends on the how many classes you taking so we avoid those approaches we take a retrieval based approach and there are also some approaches which are image to image retrieval approach yeah but we don't do image to image because you need to have a gallery of the whole world whole earth which is not possible to have so we take image to GPS retri so what we do is we pass the image to image encoder clip based image encoder yeah and there a gallery of GPS coordinates we pass it to location encoder to get location features okay and then we do contrastive Lear yeah so now the challeng is these are 2D coordinates we want to get a high dimension representation right for that this is our location encoder so how to do that so let's say this is the GPS coordinates we first pass it to equal projection mhm so normal GPS coordinates traditional laong have inherent distortions some countries are over represented some countries are less represented near the polar regions and so on so forth so we do a better representation equal representation okay then after this again 2D only so then what we do is we pass it to random Foria features random for features is a signos ining technique so we take a matrix R sampled from some gion then multiply it and then have SS and cosin of but here if you take a particular Sigma it is capturing a particular type of frequencies so we want to capture frequencies spanning from both uh corser to fine grain because we want to perform well on both course I mean fine grain and also course and fine grein both this is F this course MH so we take a series of sigmas pass it to Dedicated MLPs combine them by simple addition operation that is our feature so you have a multifrequency like the all the frequencies in combined together it's like a random up projection but in different frequency or so now uh these are performance on these different data sets yeah and across different threshold metrics performance is better and also the good thing is if you have a text as a query you can replace this uh through a text replace the image encoder through text encoder is already pre-trained you don't have to train it again and this is also implicitly aligned so you can get a similarity scores for all the GPS coordinates you can get a similarity map kind of a thing yeah right so your if the text query is Desert let's say with a generic query it can so it is able to localize throughout the world yeah and also if it's a specific City or specific landmarks is Aizer and lastly the thing is the location encoder can be used for other tasks also okay if you have a task like classification you can also use our location features to combine the image features and get better performance very cool um would it be I'm just guessing but street view is sampled very differently throughout the world right so here you're kind of trying to get the area you know relative to the size but could you also have a transformation that just transforms so that where street view pictures are more often you get like higher density of grid points right right would you expect that to help that helps actually yeah so our Gallery during Gallery construction we try to use the training set knowledge training set data has been prepared via uh more I mean more samples from those region where human reside less samples from those which are not important like oceans or deserts so we that prior knowledge in the training set we utilize that one so we randomly sample the train set to create our gallery yeah that helps better than randomly sampling the whole earth yeah okay very cool and yeah I wonder like what what you can use this stuff for will this mostly work for stuff that's also represented in the images or I don't know like what here if you enter for example I don't know Christianity what would sort of the Christian World light up or is it only things that you know kind of repres presented in street view images so it might be possible um what happens to that is that whatever features are left in the beding space are the ones that are useful for geolocalization yeah right so for examp you if you have the quer Christianity given there's so many places that are like Christian it wouldn't be a feature that make a recogniz so maybe that Dimension will be collapsed when passing through the features like different linear layers and do you have the same model for all of these granularities or do you train like different models we have the same model for all the granularities and you just rely on sort of the different embedding sizes you have there yeah very cool have you competed in geog guer with this we would like to actually you haven't tried we haven't tried yet there's this uh there's a couple of these YouTubers right rainbow and yeah exactly you got you have to call them and be like hey we have a new best AI and they'd be pumped yeah that would be cool to try yeah all right thank you very much thank you pleasure thank you very cool work videos and explanations are too good very nice thank you very much

Hardware Resilience Properties of Text-Guided Image Classifiers https://arxiv.org/abs/2311.14062

is it okay if I record yeah okay what do my text that image classifiers have to do with Hardware oh uh so basically what we're trying to do is we're trying to use those uh text features that are generated from the text and then trying to initialize the last layer by you instead of like doing a random initialization for the last layer that based on how we do previously we wanted to use those class descriptions or those text descriptions and then pre-train the last layer and then now you can plug in any machine learning architecture that you have resonate Alex net or even your own classification architecture with it and then use it to trade it so what it does is when you do those uh like text descriptions uh and then like initialize the last layer it makes the model more confident enough for it to actually be resilient when bit flip errors or as manufacturing defects actually happen on the hardware okay so the original problem is a hardware defect exact in what exactly in the image sensor or in the um the hardware defect is actually like we're talking about the deployment when the models are trained and then deployed on the hardware yeah when Hardware could be like could be affected by like voltage droops uh manufacturing defect and so on and then when that happens silent data Corruptions actually happen on the model which makes on the it doesn't crash the system rather it makes the model do like wrong inferences instead of the right ones okay so if I imagine I train my model for myself driving car I deploy it in the car the model gets corrupted and then my car just kind of gets worse yeah it might like speed up when it shouldn't or it might actually like detect something incorrect okay so that's this part right here this part might get worse exactly okay and then so like this is the your so after inputting this is your final model like architecture so uh if you didn't have when this one gets corrupted now it'll be lead doing wrong inferences instead of the right one yeah and so what's your method exactly doing to counter that um basically like this on it gives us all the features and what happens here is that we actually have instead of like initializing the last layer values randomly it gives it the initial setup or a smart way to give it the initial setup and whenever you're doing that training it'll be more confident on whatever inferences that is going to make okay so this is let's say an image classifier Y and you initialize this here somehow yeah to make it more robust exactly what does this have to do with text input exactly okay so basically what we do is let's say you have like 10 classes yeah so uh currently we use CH GPT gpt3 so we ask gpt3 multiple questions for each class and the gbtc gave us multiple descriptions so we took those descriptions and then we gave it to clip to give us the embeddings and then those embeddings now we have multiple descriptions with embedding so we averaged that one so now we actually have a more robust information about that image and then after averaging it we use that to initialize the projection layer so you just get somehow some sort of a class embedding that you get out of all of these exactly uh texts about the different classes that you get from sort of the clip text encoder yeah that's very smart um and uh but how why do you think it helps again so if I corrupt something here right then you're saying this is more robust now why do you think that is yeah well the thing is like when bit flips let's say if a bit flip happens like in the beginning layer or in between there's a probability that since convolutional neural networks when it's doing like the when it's going through there's high probability that of it being corrected through the process by r or other activation functions but if a bit flip happens on the final layer it's actually like affecting the final decision of the model yeah so the best way to actually increase the resilience is what if we could actually focus on the most vulnerable layer okay instead of uh the hall and if so sometimes people do check sums and things like this I think Hardware gpus already have sort of error correction built in doesn't that also help or is what you're doing kind of compl to that yeah well like the existing like the state of D or the existing one like the parcy checks or a correcting course mainly focus on the storage but it doesn't cover the computational part okay and like let's say for example Tesla include introduces full modular Hardware redundancy on which they introduce redundancy but with that what comes is that Hardware is actually using more power and since it has redundancy it's actually like being latent and taking more time in making the decision of course the compter science environment the lat actually matters so what we're providing is instead of uh providing a power consuming and then more redundant solution what if we could actually train the model thinking about the hardware resiliency I've seen very cool and then you test that do you actually trash your Hardware or do you just sort of artificially corrupt some stuff yeah we have a framework called Golden ey so golden ey actually simulates a real time Hardware environment on which we use goldenite to do the error injections on both the Baseline and then our proposed model and then after doing that we check uh two we we have two uh options in which we measure that's Delta L and top to diff yeah so Delta L gives us the overall information on what's Happening when wrong and right inferences are being made and top to shows us how confident the model is when actually making the classifications and then by using that we see how the model is actually resilient M based on the injections we did on it very cool and is it more resilient it is and like surprisingly like um we achieve actually 5. 5 times average resiliency Improvement okay with up to 14 times Improvement and then uh with the minimum drop of 0. 3% on inaccuracy and then that accuracy drop is actually mainly on the noise not like on the noise image that would that are only already highly likely to be misclassified mhm very cool so you're saying the where these things matter most these flips and so on is in the images that are already kind of shaky not sure exactly so what we're doing here is actually increasing the model's confidence in making those decisions okay so it actually increases the model confidence across the board uh but it most matters in the ones that are on the edge in on the border so yeah that's very cool yeah thanks a lot for the explanation thank you so much thank you have a nice day thank you so much take care

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник