# Parameter Prediction for Unseen Deep Architectures (w/ First Author Boris Knyazev)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=3HUK2UWzlFA
- **Дата:** 24.11.2021
- **Длительность:** 48:07
- **Просмотры:** 15,696
- **Источник:** https://ekstraktznaniy.ru/video/12640

## Описание

#deeplearning #neuralarchitecturesearch #metalearning

Deep Neural Networks are usually trained from a given parameter initialization using SGD until convergence at a local optimum. This paper goes a different route: Given a novel network architecture for a known dataset, can we predict the final network parameters without ever training them? The authors build a Graph-Hypernetwork and train on a novel dataset of various DNN-architectures to predict high-performing weights. The results show that not only can the GHN predict weights with non-trivial performance, but it can also generalize beyond the distribution of training architectures to predict weights for networks that are much larger, deeper, or wider than ever seen in training.

OUTLINE:
0:00 - Intro & Overview
6:20 - DeepNets-1M Dataset
13:25 - How to train the Hypernetwork
17:30 - Recap on Graph Neural Networks
23:40 - Message Passing mirrors forward and backward propagation
25:20 - How to deal with different output shapes
28:45

## Транскрипт

### Intro & Overview []

hi everyone welcome to another video today we're looking at parameter prediction for unseen deep architectures and we have a special guest today boris niezef who is the first author of this paper actually so this is a first i guess for boris and for myself to review a paper in a bit of an interview style the plan is that we go through the paper together there's also been some reception in the public because as you might have heard the paper claims to be able to predict the parameters of an of a neural network without you having to train the neural network at least i guess that's the overhype that then leads to people saying wait a minute that can't be true and we'll go exactly through this we'll look at what they've done right here and um yeah boris welcome so much to the channel thank you for a great introduction i am very excited uh to be here as well so i'm ready to take uh any critique from you so how did this come to be you're at the university of guelph and there's i see vector institute facebook ai research and the github is under facebook research yeah um how did this come to be so this project uh i started as an intern at facebook ai uh in summer 2020 so more than a year ago and all and our collaborators are from facebook ai so met meta-ai right and yeah that's why we decided to keep uh the code on the facebook research so yeah cool excellent and if we can i mean let's dive in um right here essentially what we've said so far is you have some kind of a neural network with whatever a bunch of layers and a bunch of computational nodes and you have the weights in between somehow right so w1 w2 weight matrices but not only that you have normalization you have any kinds of things and usually we have some data set x and we have some result y we train with back propagation to find the best parameters but in your case you went ahead and you essentially built this hyper network uh hyper graph network that is able to take in if i remember correctly the data right yeah and the architecture like the structure here of the weight matrices all of this goes in and into a neural network which is a graph neural network right yeah and so some sort of a graph neural network will go into that what exactly this and out comes the these weight matrices uh so and you're able to do this without training the weight matrices ever so you just predict them yeah so what one correction here the network this hyper network doesn't take data as input it's trained on specific data set say c410 or imagenet but at test time it doesn't take data as an input it only takes a network as input and that's why yeah it cannot generalize to other data sets okay so it is you do experiments that i see here on c410 and on imagenet uh so these are two different let's say hyper networks that you train one for c410 and you train another one for imagenet well in fact i train many networks sure but it's not one network that is going to predict the parameters of for any data set no yeah so we release one network for c410 and one network for imagenet correct okay and this so here you say by leveraging advances in graph neural network we propose a hyper network that can predict performance parameters in a single forward pass so the single forward pass what does that refer to it means that i feed the architecture one single time through the graph neural network or yeah so this phrase is to highlight the difference between say recurrent networks where uh so there are some meta optimizers right and they can also do something similar to our work but they require like many iterations in our case like it's a single propagation basically through the graph yeah and then you get these you get these uh these parameters out which is i mean that that's pretty cool and then you say um on imagenet oh sorry on c410 you reach a 60 accuracy and on imagenet you reach a 50 top five accuracy now these are let's say they're respectable numbers they're better than random but they're way below anywhere you know near what i could get by actually training a network that's was this urine or was this your intention or is this is it still surprising that you get this good numbers yeah it's still very surprising to me and to other co-authors and to many other people i guess uh because it's very hard like when you have a novel network uh the assumption is that you know you cannot predict parameters for that like if you predict it will be like some garbage neurons so because uh there is a complex complex interactions between neurons uh so it's very hard yeah for a novel network you yeah it's very hard that's the assumption uh i don't know okay make sense of course it makes sense of course yeah i mean it's it is in a way it's you know the numbers aren't good but they are certainly good for never having you know trained um yeah but there is a bit of a because the hyper network has been trained on that specific data set and maybe we'll go a little bit into uh your what you exactly train this on

### DeepNets-1M Dataset [6:20]

so you introduce a new data set which is this deepness one m data set right could you tell us a little bit about this so this is the essentially the basis for learning this hyper network yes so it's a data set of training and evaluation architectures and it's called deep nets 1m because we have 1 million training architectures so we predefined them and we saved them so that people can reproduce training probably and the idea there is some misconception that we actually also have trained weights for those training networks but no we don't like we didn't train one million architectures uh yeah so and the architectures are almost random uh in the sense that the operations and connectivity between them are constructed in a like random way by uniformly sampling for from a specific space of architecture so you define you define a space a design space which you call this right and this design space consists of things like you know you can have a convolution or you can have an ml or sorry you can have a linear layer right or you can have an attention layer right and then that's followed by either a batch norm or a weight norm or not no normalization at all right and then that's followed by right and then you build sort of these combinatorical things so one architecture would be a convolution with a weight normalization and with something else and then also the design space includes kind of the parameters for this so for the convolution you could have i don't know five three or one on one side like so i can have a five by five convolution that has maybe is only depth wise and not fully convolution and so on so there are all these sort of nested cartesian products yeah of this big space that you define and then essentially you could say you fix a random seed and then you sample about a million times yeah that'd be a fair characterization so that you say okay with this we sample a million times from a fixed random seed and that so everyone has the same networks to train on yeah that's a fair uh statement and so there were some data sets like this before to do neural architecture search specifically but you say you've extended the design space a little bit and that so before these networks they would include kind of the design space is large enough to include sort of the modern networks but you have you've sort of extended that even a little bit more right um right so usually those neural architecture search works they have a quite constrained design space because they mainly consider very efficient networks like efficient network or squeeze net mobile net but resnet is out of their design space because resnet is considered a waste of resources in in the nas community yeah but in our work we are still interested in predicting like these large uh uh parameters let's assume that you had actually trained all the weights for the million architectures right and you train your hyper network to predict these weights and i sample a new one and um then it could be fairly like someone skeptical might say well you've probably seen a very similar network um during training right so you just memorize the weights from that so there are two differences here as you said you don't actually have the trained weights of these million architectures which i want to come back in a second but you also have these out of distribution samples do you want to maybe comment on what are the out of distribution architectures for this data set what do they look like right so the industry well i first i say what is in distribution uh to highlight the difference so in this in distribution is uh the test set uh it is the same as the like it uses the same generator to sample architectures as the training architectures so and while the architectures are still all different they as you said they can be quite similar and we actually measure that in the appendix like we have some data for that so that's one of the reasons we designed those out of distribution uh splits and uh yeah the motivation was to test particular distribution shifts for example what happens if the networks uh become wider like have more channels like wide resnet instead of resnet or for example what happens if we want to predict the parameters for a deeper network say resonant yeah 150 instead of resonant 50 right so there are these there's sub categories right there's wide and deep which are wider or deeper than you've seen during training and there is also this batch norm free uh category right so there are various aberrations that you didn't see necessarily during training but yeah i think it's fair to say that the performance of your method still comes from the fact that you know the network has been trained on certain things it's just a matter of how much does it generalize to new architectures yeah yes for sure it was trained on all uh like operations that are used to compose out of distribution networks but it wasn't trained on that particular configurations like compositions so it's still and how just if we jump to the results like just an aspect of the results how different are the weights from like do you know what happens if you just kind of copy over weights from the most similar network in the training data set does this work at all or have you done any kind of you know dumb bass lines to compare i tried but it turned out to be more difficult uh than it seems so you need to come up with many different heuristics like how to yeah copy ways if the dimensionality doesn't match or like if the layers uh like not exactly the same so there's a lot of those and it becomes basically a separate research project to develop this uh yeah dump baseline so we didn't go in detail with that yeah

### How to train the Hypernetwork [13:25]

so this is i guess this is the training loss what's special about this as you said you don't actually have to fully trained weights of all of these network networks but essentially what you do is you sort of back propagate through training these networks if i understand this correctly so what you have here is a double sum over n and m and n here is the number of tasks so what is a task right here uh uh task is uh so we use the terminology from meta learning right so but here the task is uh network okay so uh this is the this m is the number of training architectures yeah and n i presume is the data set or yes it's the number of samples in a data set like what can you just yeah so we take one point one day to point x right uh xj and what is the a right here that is the architecture that we sample as one of the m architectures so we take the x we take the a and this here that's your network that you actually want to train right yeah so as you said it does not get the training data point it simply gets the architecture and it has a set of parameters which are ultimately the parameters we want to optimize right and the so the f here i guess that is your way of saying take this network predict the weights pass the data point through it and get the output yeah exactly that's a fair characterization for forward pass of images through the predicted parameters to get the predict predictions yeah so yeah f calls if i were to program this then f would call your network to get f would instantiate a would call your network with the architecture will get back the weights put that into a and pass the data point through once yeah exactly and then we simply compare it to the label right which we have and then i guess this loss right here is cross entropy loss or whatever is appropriate for the data set yeah so you can basically reduce this equation to equation one if you if you freeze the architecture so if you if m is equal one uh and instead of having a hyper network you have like fixed weights uh w uh then it's the same objective and it's the same loss and then you learn by back propagating if i see this correctly so usually what we do is we forward pass x through this thing right and then we back propagate to these weights right here but what we do is we simply continue back propagating through the weight generating function into the hyper network yeah and all of this is differentiable i guess the weights are floating point numbers and the way the graph network works is all differentiable so you can essentially back propagate it through the parameters here so every i guess every part of the graph neural network would have weights and you can back propagate it through that okay yeah i mean that seems reasonable enough oh this connection here that's not happening no data to the graph network for now yeah cool um this seems it seems pretty straightforward so now maybe we talk about what exactly the graph neural network is getting as uh features and when we talk about

### Recap on Graph Neural Networks [17:30]

graph neural networks it's always a bit um there are many flavors of graph neural networks but i'm going to try to characterize it um briefly so we have nodes in the graph neural network and each node has a bunch of features initially so each node gets a bunch of like a vector of different features in our case the nodes each would refer to different operations different modules right so this here could be this cure could be the conv the convolutional uh the convolutional layer in the first layer this then could be the batch norm that follows in the first layer and this could be the convolution in the second layer and so on so we connect the graph neural network in the same way that the architecture is connected right so the graph neural network changes from architecture to architecture oh uh the graph no the graph neural network is fixed so the graph neural network itself doesn't have no it doesn't have nodes right the graph network will not have weights like uh theta right and this theta are basically like uh a matrix uh with the number of input features and the number of output features yeah uh so the those weights are fixed uh what is changing is the input that is represented as a graph i see so this here maybe we can more characterize as the input yes and that goes into a let's say a standard neural network with a bunch of layers yeah but the input is essentially what you call i think a which is the it's an adjacency matrix yeah so this graph would be described by an adjacency matrix and for luck i don't exactly remember how you called it but let's call it f the features yeah of each of the nodes right and these things will go into the neural network and out would come your different uh weights for the graph yeah and yeah so the way graph neuro this these graph neural networks work is each node essentially starts with a bunch of features here right this has a vector and then you apply these functions so every layer here would correspond to one message propagation step if i understand this correctly where all of the neighbors they would sort of pass messages to each other given differentiable functions so if we consider this node it would sort of receive from all its neighbors it would receive messages it would compute some sort of hidden state and then in the next iteration it would pass that hidden state to its neighbors right this is the basic functionality now you in your particular case have opted for a bit of a more advanced architecture right here that more mirrors sort of the propagation in a neural network can you talk a little bit about that maybe right so we actually we are doing almost the same as the previous work on graph hyper network so i wanted to clarify that the training objective like equation 2 and the graph hyper network architecture is almost the same as the previous work but yeah it they didn't release the open source code so we had to like uh reinvent something yeah of course uh but uh so uh sorry what's what was the question so maybe before that i want to just for people who may not know graph neural networks it seems like there's a lot going on but essentially a graph neural network boils down to just a few functions because what i've described right here this i receive the hidden states from all my neighbors and i integrate that right this function is in fact the same function as you know the node over here which also receives stuff from all its neighbors and integrates that would be the same function with the same weights right it's just that the inputs are different because of course for the node here in the middle it has different neighbors than the node here but the weights of that function that takes messages and integrates them that's the same for all the nodes and that's why graph neural networks very often surprisingly can have very little parameters and can achieve a lot of power um yeah and then so all these steps right here i think you've implemented them as a recurrent neural network that simply passes on so we would do multiple rounds of these steps and then right the nodes would in multiple steps compute updates and up so even you could implement this as separate functions you could say time step one is one function time step two is a function time step 3 is another function but you've even chosen or i guess previous work as well chosen to implement this as a recurrent network so not only are all the functions across the nodes the same but even across the time steps they are essentially the same because it's a recurrent neural network so surprisingly little parameters and the advantage of it is i can pass in any graph right the graphs they don't have to be the same they can be totally different and i can apply the same function because it's essentially vectorized across the whole graph which is going to play into your batching methodology as well i guess once we come to that but my question was essentially you've so you do this first iteration right here so the first iteration is just like a regular graph neural network and then the second iteration sort of your improved version this gh and two yeah it has a bunch of it but has a bunch of

### Message Passing mirrors forward and backward propagation [23:40]

um tricks right here no that's not even that i think that's already in the previous version is that your message passing algorithm if i understand correctly isn't as straightforward as here it's not just i get from all my neighbors but you have two different message passing algorithms one mimics the forward pass through the neural network and one mimics the backward pass so in one round i would only get messages from my dependents yeah and in one round i would get the messages from my like upstream depend ds yeah exactly so that was part of previous work as well or yeah they developed this specific uh like version of gated graph neural network that mimics this behavior uh of forward and backward propagation and yeah what we found though that just one round of propagation is enough so uh we only do it once forward and once backward yeah we don't okay you can't do it multiple times but we found it's just wastes uh yeah resources so it doesn't improve accuracy for some reason so essentially training your hyper network exactly mirrors training a real network in that you do a forward prop and a backward prop but yeah what you do is you simply back propagate that through the weights to the actual graph neural network weights yeah so in that sense yeah it mimics how the networks are trained like yeah and so now i guess what you get out

### How to deal with different output shapes [25:20]

then is sorry to come back to this again but every node sort of keeps updating its hidden state as this progresses and at the end you have a hidden state for each node which is kind of the final hidden state and then you put that into a into a decoder thing this thing right here so there how do you deal with the fact that you know sometimes a convolution has three by five parameters and sometimes it has you know yeah seven by ten parameters and sometimes it's an attention network that needs query key and value how does a single architecture produce a different number you can reshape right but even but especially different number of parameters yeah so that's actually the tricky part and uh with we didn't we did something very naive actually and there is a lot of room for improvement in that part and what we did is uh we applied this styling strategy and we defined the first we defined the tensor of like a fixed what we call maximum shape uh and if we need to predict a larger uh tensor uh we tile it multiple times across uh channel dimensions as needed so essentially the tiling means right copying the same tensor multiple times to make it a full shape and if we tile too much then we slice it and uh yeah and we slice along the all four dimensions like height weeds and channel dimensions so it's quite naive and limits the expressive capacity of the predicted parameters but that's the yeah that's the only method we could make work like in the efficient way so far so there's yeah i guess there's room for some sort of weight weight up sampling some sort of technique where you don't have to know the number of outputs uh before you predict those outputs right yeah like some more yeah or recurrent networks like that to put predict parameters uh as much as you need like uh sort of like in this yeah one at a time yeah or something like do you know these um nerf or siren these implicit neural networks so essentially you give the x and y coordinate of a picture as an input and then it would simply predict that particular pixel right you could do something here where you could parameterize maybe a weight matrix from zero to one and you could just say you know or not even from zero to one but you could parameterize it somehow and you could just ask the network give me the output at this location and at this location yeah that's an interesting idea actually yeah but okay i guess the other regressive part might be more useful because you want to somehow coordinate the weights with the other weights you've already produced yeah so yeah that's also a tricky part so yeah so you make some

### Differentiable Normalization [28:45]

improvements right here that you also highlight which is okay the differentiable normalization which um you want to comment on that briefly what what's new here uh so in the original graph hyper networks uh they predict parameters and they use them they use those predicted parameters directly in the forward pass and during training we found that they start to explode like uh similar to uh i guess you know other kinds of uh training uh and so yeah predict parameters to really become a huge and what we found useful is to normalize them so such that these uh normalization is similar to the initialization method that people typically used yeah so that yeah the scale of the predicted weights like the range of values is approximately yeah the same as uh in the randomly initialized networks so there's like yeah i mean this yeah this here this looks a lot like sort of the the formulas for take incoming and outgoing number of of unit of units and normalized by that you even use here the fan in and fan outs and i think these are terms people recognize from initializations which i guess yeah this it makes sense at least as sort of intuitively and then this is what i this is here what i found um

### Virtual Residual Edges [30:20]

one of the interesting parts is these virtual edges that you introduce so we said that these um arc these graphs that you build they mirror the neural networks essentially they represent the architectures of the neural networks specifically the computation graphs and you have some examples right here now they're a bit small but this could be this is i guess something like a convolutional neural network with a residual connection is the left one right because essentially the blue here are conf modules and you can see these conv modules like here is one and then you can also see there's kind of different paths and then there's always normalizations and so on so these are convents with residual connections yeah as a computational graph right here right yes something like that and so that you you've somehow found that is it's not enough to build the graph like this you can do better by introducing more connections between things how did you get about this right so the problem is that uh the node propagation uh step that we talked about like just uh before uh has the problem of uh propagating through the long a sequence of nodes yeah so the final node which will be usually a classification layer will have little information about uh features in the first uh in yeah like in the first layers and that's a problem yeah because uh essentially uh graph hyper network doesn't know much about the overall global graph structure when making predictions so this virtual uh connections uh improve like global context and how do you decide so you this it looks something here is a uh an example you give from a of kind of a like an illustration illustratory graph that the computational graph in dark and the virtual edges in green how do you decide which things to connect so we use the shortest path distance between nodes and we scale the edge this vertical edge weight according to the inverse of this shortest path this distance so is at the end is everything connected to everything uh we have some like cut off that we don't connect too far nodes okay yeah but so that you're saying the parameters of the virtual edges here they're shared with the parameters of or do they have their own parameters on parameters uh so there is a in equation four uh you know there is a mlp sp yeah so that's uh like a separate network so to avoid the confusion between real edges and uh virtual okay yeah i mean i guess the edges they don't have weights themselves but you do make when you propagate through the graph neural network through these functions you do make it a difference between the real edges and the in edges you introduced right so in a virtual case uh instead of just averaging features of neighbors it's a weighted average where the weight is okay yeah coming from the shortest path distance cool and i guess i think i find it a bit funny right that you are in the hyper network essentially you again run into the problem well what if our network is too deep which is essentially the same problem that resnets had yeah before res and then your solution is also hey let's introduce like residual connections between things so that information can flow further it's kind of funny to see that sort of the yeah the problems repeat one level up and then of course the it's a different thing than a residual edge in a network because this isn't the network this is the input to the network but it's uh it's kind of analogous right to a residual connection yeah that's true and that's uh that was our

### Meta-Batching [34:40]

motivation basically yeah and then the last thing is this meta batching um which do you want to maybe talk about this a little i understood this as you essentially what what's the difference between this and um mini batching uh so mini batching uh well uh usually we refer to a batch of images right so we for each training iteration we sample a mini batch say of 64 images but uh in the baseline ghm they sample a single architecture for each training iteration so it's a single architecture and 64 images uh now the gradient becomes very noisy because for each architecture the the gradients are quite different yes and to improve that uh like stability of convergence uh we sample a batch of uh architectures of architectures and a batch of images yeah or okay so and then do you do is it do you do x1 architecture one x2 architecture two or do you sort of build x1 actually build up a matrix and then pass each image through each of the architectures in the batch uh no we uh just do the first option the idea so you just sample a bunch of i guess the analogous would be if i train a regular neural network the analogy would be i you know if i just always sample from one class because for example in imagenet the data set is structured into these folders right and every folder has one class yeah if i want to make my life easy i just go you know i do ls of the data set and then i just go through it and then i always sample from one class that would give me a very bad gradient because in the same batch i always have examples from the same class and here you're saying the problem was that people in the same batch they always had the exact same architecture yeah and you get a much better um estimate i guess of the gradient if these are sampled at random it makes a lot of sense cool and yeah so

### Experimental Results [37:00]

that's where we get to the experiment i think the um experiments the main things we've already sort of taken away in that this is on um on c410 right here uh you have the test set um sort of you always split into average and max uh which is sort of performance is that that's test set performance right it's a test images and test architectures and the most from and the max is the maxis over what exactly over architecture so it's like the best what what's the performance of the best architecture that yeah okay i guess that's a fair metric to look at because you know each architecture if you were to train it doesn't have the same performance at the end right so with the max you can expect that um the max uh would be an architecture that if you trained it had sort of the state-of-the-art performance of at least the networks you consider the architectures you consider so maybe for c410 that might be 96-ish percent 97 yeah and yeah so you get some yeah sorry yeah we compared to sgd uh below right uh yeah and you have a similar sort of phenomena yeah that you have average performance and you have like the best performance that is a bit higher okay so that's yeah 90 i see that 90 93 for 50 epochs i guess the state-of-the-art things they are reached with a bunch more tricks like uh yeah like augmentation more implementation yeah i see yeah okay but i mean there is a considerable gap but it's still pretty cool to see right that you get especially um also the improvements that you make in within this parameter prediction regime uh with these new models or quite considerable and so if i consider for example from this or this which are 60 to 77 which is sort of like 80 and then that's like almost half of the way of the error right that you make compared to state of the art that's pretty really good and even on the out of distribution it seems the effect is even more um drastic right so do you have any idea in to why why on a on the out of distribution set your performance it drops but it doesn't drop too much whereas these other methods performances they drop much more right so i think uh like those three tricks uh play a different role in each uh out of distribution split for example in the wide case i think what helps and we have those ablations in the appendix what helps is parameter normalization because when you have like a lot of uh weights then it's important like that they are appropriate scale and in case of for deeper architectures i guess what's important is to capture this global context because yeah the network is yeah has a lot of nodes yes and similar like for other uh splits uh yeah so i well for others please maybe it's less intuitive what exactly uh like what single component makes it work but i guess some interplay between like those three tricks help nice yeah and so and then at the end you say um so you have a lot of ablations which is really cool to see and open source code and everything uh that is i think a lot very appreciated especially on a more exotic task like this one you also are able to predict some properties like you know what's the accuracy of the network and so on where you make sure that your network really learns some kind of intrinsic properties right of the networks you're predicting so the network's not only able to predict the weights but it's also able to say you know what's the what's going to be the inference speed of the network approximate accuracy on that it's going to have and so on which really makes sure or it's at least a bit of a bit of a confirmation that you're doing something um meaningful instead of just copying over weights so this would be to counter anyone that says well you're just kind of copying over from the training set and then the last thing is you then also experiment fine

### Fine-Tuning experiments [42:00]

tuning um fine-tuning the predicted parameters now obviously in this regime we're kind of like meta learning right we are learning a good initialization and from that initialization i can then fine-tune uh but then so my question is how much time does that save me if i use your network to predict the initial parameters of a resnet 50 how much less time do i have to invest into fine-tuning it if as compared to training it from the beginning so we actually provide speeds in the table and you can see the difference of time you need so it's not as much as you want maybe so as you see we like uh predict parameters is like in less than a second but you can achieve the same result by pre-training on imagenet for like half an hour or one hour yeah um sometimes more like on transformers it takes more time to achieve the same performance so yeah would you say it's uh it's kind of would you say your work is maybe shows that something like this can be done or do you think we'll get to a place where we can make so much savings because we only have to fine-tune for like a tiny bit that people say yes i'm really going to use this instead of training from the beginning or do you think it might be mostly an academic exercise no i think we can arrive at that uh if you can see if we train for just five epochs uh yeah on imagenet then we all almost get the same performance as we train for 100 epochs or like 200 epochs uh yeah like slightly worse uh and uh my hope is that if we are able to predict parameters that are similar to five epochs then we are done uh with we don't need to predict the performance of a 100 e-book network yeah i see yeah i mean it makes sense it would save like it saves like a bunch of time and resources and everything and potentially allows us to investigate new and better architectures much faster rather than and especially if we scale to larger models like if this holds also for gpt style models especially if it generalizes to way larger architectures right it might even be that we're at a point where we get such a large model that it's prohibitive to train it from the beginning but we might be able to predict and then tune so technically like implementation wise our yes in our model we can't predict the model we can't protect parameters we sorry we can't predict parameters for a network with trillion parameters yeah sure because yeah because we use this styling right so we can predict exactly but of course it will be very

### Public reception of the paper [45:25]

bad it may be difficult to fine-tune as well and so the last thing i want to get into is sort of the reception now you have said previously to me that it has been kind of maybe received a bit um out of context or you know a bit oversold what do you mean by this i think uh maybe people got an impression that we can predict uh parameters for a new task like for unseen tasks which is not true yeah and even though i mentioned that we only make a single like a small step towards replacing sgd i think people misread it and and understood it's like oh we replay we are ready to replace no we are not there yet it's far far away from that yeah the thumb you can see the video thumbnail going something like um sgd not needed anymore predict parameters for any neural network we are done that that's the title i was trying to uh convince my coasters i mean it's a good vision it's a good it's a nice vision to have right um but yeah it's important to point out you do generalize very well to unseen architectures but it's always within the same task now i guess the hope for the future maybe you're already working on this or not would also be to investigate generalization across data sets maybe you can imagine a situation where you have your system on imagenet but then you um so you've trained it for imagenet and then you maybe give it a little bit of the data set of a new task and it's able to adapt really quickly to that or something like this right so it would already be quite useful i think right and there are already works that actually do that like in the meta learning sense but yeah normally they don't generalize well across architectures so they generalize well across tasks that's their focus yes but not across architecture so there should be a way to combine those two yeah sounds exciting all right i think um this is a it's a really neat overview over the paper um we'll end it here boris thank you so much for coming here and yeah good luck on your future research yeah thank you for inviting me and it was very fun to go through the paper so yeah i'm very happy thanks a lot