NeurIPS 2023 Poster Session 1 (Tuesday Evening)

19:03

NeurIPS 2023 Poster Session 1 (Tuesday Evening)

Yannic Kilcher 14.12.2023 30 856 просмотров 667 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Papers: CamoPatch: An Evolutionary Strategy for Generating Camoflauged Adversarial Patches (https://openreview.net/forum?id=B94G0MXWQX) FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning (https://arxiv.org/abs/2309.14062) Causes and Effects of Unanticipated Numerical Deviations in Neural Network Inference Frameworks (https://openreview.net/forum?id=6zyFgr1b8Q) Conservative State Value Estimation for Offline Reinforcement Learning (https://arxiv.org/abs/2302.06884) Links: Homepage: https://ykilcher.com Merch: https://ykilcher.com/merch YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ykilcher.com/discord LinkedIn: https://www.linkedin.com/in/ykilcher If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

All right, this is the Tuesday evening poster session and there are tons and tons and tons of posters right here. Look, there's posters from all the way back there and we'll dive into them and we'll see what's happening and we'll try to get some people to talk to us and yeah, get some impressions. Maybe we'll meet some people here and there. Let's go. Now, if you want to do a good poster, the best thing to do is definitely get like a big poster. Don't put a lot of text on it. Like the best posters, they have like one sentence that's really big and then a few pictures and then maybe a little bit of text. Not to criticize any of the scientific work here. Don't just put tiny text and tiny math on it. Like put giant text on it and lots of pictures and then people will come and then you can still explain stuff. Doing a bit of reporting. Oh okay. Yeah. I did work for a while in adversarial robustness and so on. What do you call camouflaged adversarial patches? What's your name? Sorry. Phoenix. Okay. Bionic yeah flash patches is almost like what it says in the tin right we are creating the adversarial patches that have camouflaged themselves within the image so it's difficult for a human to witness where the corruption lies let alone an AI detection model so we see that in a lot of the state-of-the-art works is they don't really consider this visibility of departures an issue because they say we're only corrupting a small area but a lot of defense mechanisms that they've used kind detect that or just filter it out, but because it can be seen or it can be detected very easily. So, we're saying something different. We're saying we're going to camouflage it within the image and so these AI actually can't detect it and our empirical studies show that. What do you then call camouflaging? Because usually the patches like usually the distortions are so small in terms of value that people argue you can't see it. Yeah. Right. So in adversarial patches that's usually not the case. Actually the distortion is usually quite visible and clear. It's only the other kind of types of adversarial examples that modify the entire image that you might say are cannot be seen. So in this sense we're only actually modifying a small portion. Even though we're only modifying a small portion of it, we can even cause such a small amount of modification and still break the AI essentially. Okay. And so this sort of highlights the vulnerabilities that are still there even in state waters. How do you do that? Yeah. Well, how we tell it, you know, um how do we do that? Well, we kind of we follow an evolutionary algorithm approach to the attack process. So, and that is basically we start off you have a look up here with a random patch. You can see it's quite obvious and quite clear. What we do is we try to optimize the location of that patch. So, find where the most vulnerable location is within that image. Okay. Once we kind of find that, we say, "Okay, now we're going to try and minimize the visibility of that patch. " And so we are kind of optimizing the characteristics of these semi-transparent so such as the size of them, the color, you know, the transparency, the location of them within the patch. Okay? And we kind of like optimize all these characteristics until we get to a point where our budget is kind of, you know, has been expired. So in a real world sense it be like you call in the Google API and Google say okay you've used up all your credit or we get to a threshold where we're like okay this cannot see okay and how do you test that whether it can be seen so in our we just using an L2 kind of condition so we have a look we had a look at previous works and usually we see like there's a certain number of L2 distances where it's like okay that can't but in what we do is we just kind of exhaust the budget really and say how camouflage Can we get it? Yeah. And that's kind of what we do. Then like if you have a look at some of like kind of the future direction was this is kind of on a digital level. So some may kind of argue that is this really realistic. But I like to think of it as like a step in stone to say like it works on this case. So now let's have a look at the physical sense. So this is kind of a bit of work we work in that in the future. But the preliminary results are showing that we can kind of do the same thing. We could promote the algorithm to make noise within the mask of the human. In this sense, it's just their clothing and still avoid the AI detection. This here is the original image. So, this is the original image and this is like the this is like our initialization. So, we completely cover the person the adversarial patch and then we and then that is after our attack or after our optimization. Oh, that's pretty cool. Hey, I'm doing a bit of recording. Okay. Uh, is it okay if I record? Yeah. Yeah, absolutely. Okay. I have no clue of class influencer.

Segment 2 (05:00 - 10:00)

No, not at all. Okay. So basically this is like learning where we have old classes and we are slowly adding new classes and the objective is to learn the new classes without forgetting the old classes. We are trying to solve an exemplar free continual learning. So exemplar free is like we are not allowed to store the training samples from the old classes. So at a time you have access to the samples from the new classes only and the objective is to learn new classes and not forget the old classes. So what is a realistic scenario where this is because I can always just add a hard disk you know and add and store my own so the realistic scenario is like let's say you have a trained model in industry on a lot of your customer data your customer wants to add some new classes give you some new data so the problem is like you cannot go to the customer and tell him I need all of your old data so that I can train them retrain the model with old and new data so it's like you can add new knowledge to the model by not forgetting the old model. So this is very relevant in our industry as well as secondary. Okay, cool. How do you tackle this? So basically what we are trying to do is we are trying to use the embedding feature space. So what happens is when we have deep neural networks, we learn very good representations of the classes because of the nonlinear activation. We try to illustrate that for all old and new classes we have very good spherical representations and we can use uklidian distance effectively in this uh feature space. But what happens in continuing is so this is a high plasticity setting. when we try to learn the new classes very well and what happens is we lose the feature representations of the old classes in the what happens in a high stability setting is when we try to not forget the old classes so we are unable to learn the new classes well so this is a problem uh the stability and plasticity trade-off which is commonly studied in continual learning and what we are trying to say is instead of using ukaridian distance which is commonly used in the feature space we use the mahalanobis distance so what the mahalanobus distance why is it better because it captures the distribution of the data in the feature space so you assume it's a gaus Yeah, we assume it's a Gaussian. Okay. And so we take the coariance metrics for each class and we compute the mile of distance. Can you say anything about the fit? Like how well does a Gaussian actually fit to these classes? Like do you have some quantifiable number that tells you if that's even appropriate to do or not? Yeah. So basically I don't have a quantifiable number but it's like quite common to uh assume giance. Mhm. So but as it turns out it's not so easy to use Malanov's distance in the feature space particularly because we are doing an inverse operation of the coariance matrix here and what turns out is like uh for example in a lot of classes we do not have a lot of training samples let's say 500 samples in cipher 100 for each class we cannot get a full rank coariance matrix because like for resonates we have like 512 feature dimension space so we do a approximation over here so that we get a full rank matrix and we can perform the inverse operation we use this two transformation to make the features is more gian we perform the normalization of the coariance metrics from different classes so that it's more comparable in the distances okay so this is what we are do and what we do is like we train the model on the old classes and then we freeze the model and then we perform inference for all the new classes for all the people we're postnarrarating this because the sound was kind of terrible at the conference uh so Nora is here with me and Nora what is the paper about that you're showing us So maybe in a nutshell um we find that the inference results you get might depend on the hardware that you're actually using. So inference of what? Machine learning inference generally. Right. Okay. I see. So, I spotted a I spotted a piece of the poster here, and I was wondering, you mentioned this tiny uh thing at the back, like the seventh decimal point. Why would I care that this is different? Yeah, that's a good question. Well, you might not care if you're an a regular everyday user. Um but for example, if you're in forensic um and you then you might care about it. And one interesting thing is what you can see here the second plot is that we were able to craft samples um so that they will be classified to um one label on one machine and to a different label on a different machine. So basically we just moved it outside of the boundary or between the two bound boundary lines or decision boundaries. Yeah. So these are specially crafted examples you have right here. Yeah. Exactly. Okay. And Yeah. Why do we even have this these mistakes? So there's different reasons for that and that depends on um the so that okay yeah so there's different reasons um so on CPUs for example in general we have um floating point operations which are not associative right so if you calculate a plus b and then plus c you might get a different result than calculating b plus c and then a and that

Segment 3 (10:00 - 15:00)

is re relevant whenever you change the aggreg aggregation order and aggregation order is changed in CPUs. For example, if you're using SMD instructions or if you're using multiple cores um and in this study, we did it on a large scale. So we used um 75 on total 75 different platforms um and just recorded or yeah recorded the results or the okay so what we did is we took 75 different platforms and we this is actually here I'm sorry the bottom plot yeah yes exactly um exactly so for the 75 different platforms we get a total of 26 distinct different outputs um right so for I think 64 um different CPUs that we used. We got I'm not sure no 16 different results. Yeah. Then on GPUs it's a different uh you have a different course. Um usually TPUs are accessed through accelerator libraries and they support like multiple different um convolution algorithms and what they do is they run micro microbenchmarks for every layer to select the convolution algorithm that is the fastest at runtime. So there you also get different results between different GPUs and um the thing is if you have two convolution algorithms which are approximately the same speed um then it might depend on random factors which is which one is actually used. Yeah. So this is how we get the result in the bottom plot. um is that is for a single GPU um and yeah we get so that means we get different results even for the exact same GPU just for multiple sessions. I see. And there's also, aren't GPUs inherently somehow kind of random in their computations a little bit? Is that I've heard this at some point that even same GPU, same computation independent of this, don't they have error correction built in and then you can turn it off? Um, yes. Yeah, exactly. And for this Okay, so for this specific questions, um, I'd refer you to the main author. Yeah, my colleague which is Alexander and he'd be happy to answer all your questions. And then you also because we talked about this. So maybe I'm not I don't care about this lastdigit error at some point but if the last digit error happens in an early layer that might result in quite a big error downstream and you said you discovered that yes this actually does propagate across the layer and amplifies it. Yeah, we did we didn't really see that but also we didn't look for it. Um but yes, you're right. It might happen in earlier stages in earlier layers and then the error might propagate and then you might get um higher confidences for Yeah. Okay. Yeah. And what's the future of this work? What's next? Um that's a good question. Now, well, Alex actually I think Alex wants to um to continue with this um with posits. Um he has some ideas on that. Um and you don't. I don't Was it tough to like you had to actually go to the different hardware and do all Yeah, it was tough especially for GPUs because it's just all prop proprietary and so yeah it's really hard to find out about it. So what we did is we used the Ted flow what's it called profiler no not yeah prof profiler to record which convolution algorithms are used and it was yeah it was uh it was quite a lot of work but it was fun okay well very cool so what's the lesson that people should take away here like what's the most practical thing that people can take away from this yeah maybe be aware that it other to what we expect inference is not uh deterministic. Yeah, exactly. Okay, awesome. Cool. Thank you very much. You're welcome. What's your name by the way? Leing. Hiting. Cool. So in this paper we try to conservatively estimate the value function of state. Correct. So the value function is like the classic value function of reinforcement learning. Okay. Yeah. We try to learn some conservative V function via directly imposing penalty on all these things and how we do that. So this is the first equation. So actually for the first time it's just some standard B update one and then we can have like the second one and the third one and for this one is led by Abra. So for arme here we are trying to act mean the state value under this distribution which distribution is it? So this DS can be any distribution.

Segment 4 (15:00 - 19:00)

Okay. And it's just a notation here. But about d mu as here. D mu as refer to the distribution of the data set. Apply data set. We are actually trying to act in the minus here. Actually we are trying to maximize the value function of the states on in the data set. So this is the this system and these two equation is the same. It's just analytically solution of this equation. Okay. So you have an analytical solution for a conservative estimate of the data values. Yeah. Exactly. And we use the analytical solution to prove about here. Okay. So for this problem actually we are showing that the estimate B function behind is here and the better function we estimate is actually the law about the true V function here. We have this solar and then we come to our methodology in this methodology for this update is actually just equation one. We still use this one to learn a V here. Okay. This just a practical implementation. And for this Q here, we actually introduce a new network for Q because if you only have V, it's actually really hard for you to derive lossing. Yeah. Yeah. And then we introduce a Q here. For this Q, the update target is actually the V here conservatively. Yeah. R here. And if we have a V and Q, actually we can do AWR. And for the first time is just refer to another classy method called advantage reduction. And you actually just use the advantage to read the action. So with action with larger advantage, it just gave you more ways. And this is the first term. But I want to highlight the second term. The second term we call it bonus term. So in oping it trying to constrain the action into the state of the operate data set. However, we're thinking can we go one step outside of the data set? So this is the second step. You can see we have a minus here. Yeah. Amining minus is just armat. So we arm mass distance. Amex and R plus gamma B is just Q. So we are actually max some Q outside of the Z and this is called bonus because it's actually a little bit outside out of the uh institution for offline setting because in offline setting we are trying to learn some conservative thing but this setting is a little bit exploration is a bonus test. What exactly would you get bonus for? You would get bonus for me a cube. Okay, you are missing a queue. So you can get a little bit bonus for policy for the performance for your performance and how to select this lambda. So in the offline set is really hard for you to select a parameter because you cannot do online interaction. So how to set the lambdas? Actually we use the first time loss to see the lambda. This is just a different lambda and it's corresponding scores here. For example, we just take the first one example. If we increase the lambda a lot, you can see the first time loss just really high. Yeah. And the positive score is just a return of the positive. It's really slow. Yeah. This means in the stand of AWR the positive unsafe. So we just select we can say lambda like 0. 0 to 1. 0. Then we can select this lambda because this lambda are most are safer. So that's the best way how we lambda and this lambda differs from all of these examp but we can select it using the loss and the two figures are just performance comparison on some really variation environments and the result is that with the conservative estimation you get better estimation. Yeah we your performance is better. Yeah. Look, this is pretty cool. Okay, Heat.

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник