# Fourier Neural Operator for Parametric Partial Differential Equations (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=IaS72aHrJKE
- **Дата:** 22.11.2020
- **Длительность:** 1:05:32
- **Просмотры:** 76,388
- **Источник:** https://ekstraktznaniy.ru/video/13268

## Описание

#ai #research #engineering

Numerical solvers for Partial Differential Equations are notoriously slow. They need to evolve their state by tiny steps in order to stay accurate, and they need to repeat this for each new problem. Neural Fourier Operators, the architecture proposed in this paper, can evolve a PDE in time by a single forward pass, and do so for an entire family of PDEs, as long as the training set covers them well. By performing crucial operations only in Fourier Space, this new architecture is also independent of the discretization or sampling of the underlying signal and has the potential to speed up many scientific applications.

OUTLINE:
0:00 - Intro & Overview
6:15 - Navier Stokes Problem Statement
11:00 - Formal Problem Definition
15:00 - Neural Operator
31:30 - Fourier Neural Operator
48:15 - Experimental Examples
50:35 - Code Walkthrough
1:01:00 - Summary & Conclusion

Paper: https://arxiv.org/abs/2010.08895
Blog: https://zongyi-li.github.io/blog/2020/fourier-pde/
C

## Транскрипт

### Intro & Overview []

ai has cracked a key mathematical puzzle for understanding our world this just in from mit technology review and look at this puzzle right here it's got the bumps it's got the valleys the surfaces it's got the braille it's got the bits the ones and the zeros not only going up and down like in the matrix but going in circles um it's got it all this puzzle is really hard as you can see and ai has just cracked it um i'm being a bit hyperbolic of course this is actually about a new paper that can solve numerically solve a particular type of partial differential equations way faster than anything uh before it so this is about this new paper and we'll get into the paper in a second um it's pretty cool but as you can see m c hammer the infamous mc hammer has tweeted this out and he is actually a pretty cool twitter feed on where he regularly tweets about scientific papers and so on so pretty cool cross-domain overlap i recommend that um so we'll get into the paper we'll get into the code a little bit as well because i think it helps to understand what's going on and i want to start out by this is the blog post uh by one of the authors and it's pretty good to get a basic overview of the paper and here is the motivational example so the motivational example is the navier-stokes equation which is an equation in fluid dynamics so you're trying to predict how a fluid evolves over time given a certain parameters like its viscosity and a forcing function so basically how sticky it is and how hard you stir it and then you want to know how it evolves over time and you can see on the left is given an initial condition and i think on the right is sort of a rollout after the 10th time step until the 50th time step and the ground truth is obtained with a sort of classic numerical solver where you do little time steps and you calculate the interactions and then this takes a lot of time and compute and on the right is the prediction of this new fourier neural operator that this paper develops and you can see it's almost equal and the gist of it is that the thing on the right simply takes one forward propagation through a neural network so it takes like super like point zero something of a second to compute the thing on the right whereas the thing on the left is quite hard to compute and as i understand can take minutes so use here you see the motivational example these things are described by partial differential equations which are sort of linearized um linearized ways of describing how the system evolves over one time step and it'd be cool if we could solve this faster because this is applications in aerodynamics and other types of engineering fields all right so let's jump into the paper and as always if you like content like this consider sharing it out telling your friends about it and subscribing of course so the paper is called fourier neural operator for parametric partial differential equations and it's by tsungi lee nikola kovacky kamyar aziza denesheli borigiri liu kaushik patacharya andrew steward and anima anan kumar of caltech and purdue university so um i feel the paper is both very cool and a bit overhyped um so we're going to see what it does it's for a particular type of pdes and it has a lot of let's say engineering choices that make it possible to solve with neural networks but also that limit its applicability to where the classical methods um would be applicable where this thing isn't so there are trade-offs uh definitely to reach the sort of speed up that they reach but we'll get into this first i actually want to scroll down right here all the way because um there is something that you don't often see in the sort of machine learning field and that is here in the acknowledgements section and i just you know i just find this i just find it interesting don't don't regard this as anyone but here we are um supported by the um lwl grants which i understand is darpa beyond limits which is like a makes soft or makes a. i or systems for things like gas and oil and so on with british petroleum as a main sponsor raytheon which of course is a giant military manufacturer we have the army research laboratory and so on so um so you can see that this is uh kind of i don't know i don't see this often this is sort of uh a good bouquet of sponsorships of course there's also microsoft google and so on um yeah but it's just interesting to see that the army is pretty heavily into these things and of course they would be i mean rockets need to fly and they need to be aerodynamic and so on so um yeah i'm not saying this is bad or good i just thought it was interesting that you know raytheon uh would be a sponsor of this

### Navier Stokes Problem Statement [6:15]

all right so let's dive in as we said um we're interested in these types of problems right here where you have this thing called so the there is this quantity called the vorticity which as i understand is a derivation of the viscosity um so it sort of tells you how the fluid is moving uh right now and so this state right here and then you apply a sort of constant forcing function and you want to know how that evolves over time so you can see at time step 15 you get sort of this picture so these move past each other you see this moves here and then at time step 20 you can see they are fairly moved okay this blue thing moves in here as well and they just sort of mix and there's this uh there are certain parameters that make the fluid more sticky or not so sticky and the interesting regimes is i guess when it's not very sticky but so not too sticky but also not sticky enough and then these really complicated uh patterns occur and to predict them would be very valuable so you want something that takes in this initial state right here and outputs all of these future states and usually this is done by these classical numerical solvers so the navier-stokes equation is described by a set of partial differential equations and you can see this down here so navier stokes equation is described by this set of equations right here is there yep and you can see that the um that this is fairly complex it includes partial derivatives gradients and so on so this is the this is this vorticity and um it includes that on both sides and this is this the yeah this is two derivatives maybe or is it just the delta i don't even know i'm not an expert in uh partial differential equations by any means so anything coming from that direction don't take me for granted i'm going to give you sort of the unders the thing of what i understand from this paper and so with respect to that entire area i'm not an expert i just can understand that this is fairly complex and what you usually do is you take the initial state and um you just evolve it in time so you take this time parameter and you do you go one little time step and then you calculate because these are all sort of linear equations you calculate this one little time step into the future you update your state right it's sort of like you know your points here and how they move is given by their gradients so these are all sort of linearized things now you don't want to move them too much per time step because ultimately if this thing moves and this thing moves then the movement of this arrow will change because this thing over here moves right so you want to compute this one little time step into the future like to here and this to here and then you want to recompute all of these arrows so maybe now that points a little bit more here and then you want to update it again so you have these sort of these numerical solvers that go little tiny time step by it's not even this if here if you see t equals 20 or something it's not 20 time step for these solvers but these usually go like a thousand or a hundred steps per time step that is here or something like this they need to take very tiny steps to be accurate and that takes a long time so the idea is can't we simply simply input this let's say this thing or like something at time 15 and directly predict the thing at time 30 and that's exactly what this paper does and a lot of papers have done this before but without much success so this paper proposes to do this in the fourier domain and we'll see the path

### Formal Problem Definition [11:00]

that they take right there so they go into uh the we'll shortly go into sort of the basics right here so what you want what you're looking for is a function g that takes an a and gives a u so what are a and u are both function spaces so a and u here are functions so a is a function as you can see and u is a function but you can characterize them as data points so in this way there is a functions and data points are sort of interchangeable you can see an image like this as a data point where it's an image but you can also see it as a function where every x and y coordinate is mapped to a value right so when they talk about functions um very often they talk about this type of function where you have x y and t so t is also t is zero here x so the function would x y t map that to some value right we hear um the vorticity and you want to transform this function so this function would be a a would be the function at time let's say 0 or something or times 0 to 15 you would want to map that to the function the function u that also takes an x and the y let's leave t out for the moment it also takes an x and a y and let's say t but t is set to 30 and maps that to a verticity right so you want to input a function and output a function but it's the same as inputting an image and outputting an image in as for from an engineering perspective of course uh from a math perspective it's a little bit different but other than that it's a fairly standard machine learning problem so you have this these sets a and u and you're looking uh for this function g that maps a to u okay so we study maps which webs g which arise as the solution operators of parametric pdes suppose we have observations where a is an iid sequence from probability measure mu supported on i and u is the a transported by g it is possibly corrupted with noise we aim to build an approximation of g by constructing a parametric map um this g right here so it's a bit of a mathy way of saying we have a bunch of data points where we where a this is the initial state goes to u which is the state at some point in time and we know that there is a function g this is this g with this inverse cross we know that there is a true function that maps any a to u so a single function g that can if i input the initial state can give me the output state and what i want to do is i want to approximate this by a parametric version so these here are the parameters and of course as you can guess by now g is going to be this g right here a neural network that is parameterized by theta so these would be the layers of the neural network and we're going to input a into get out u so that's basically that there is uh quite a bit of math right here and the math here is to

### Neural Operator [15:00]

derive what they call a neural operator so here is one layer of this neural network as we said we're going to input a now a first thing that we do a is going to be let's say up projected so made into a latent representation v0 so this is let's call that here p so there is a function p which is going to be a little layer of neural network and it is going to produce this v0 so v0 is going to be a latent state of the neural network and then there is going to be a number of these layers that transform this to v1 v2 v3 and we've i think there are four layers of these in their particular implementation but there don't need to be four layers you can choose that as you can choose any depth of neural network and then at the end you're going to project that down to whatever output you want so u okay so this function here is called q and these are just going to be neural networks so p and q are going to be your very classic up projections and down projections of data point we'll get into actually we'll get into um sampling let's go actually right now so one thing right here and they stress this is that they work in function space right they don't they don't work on the let's say they don't map the data point to the data point what you could do is simply have like a convolutional neural network an image to image network and so on but what is the problem with that so if you have your a which is your initial state and um it has you know it has this bunch of fluid things right here and what you do when you have an image is you sample this right you sample this a different sorry maybe a regular grit i am terrible at regular so you sample this into a certain amount of pixels and your neural network will operate on this right this will give you some kind of a tensor which is um let's say we have a so this is a seven by seven grid okay so your neural network is going to expect this as an input dimension and whatever u is of course so you map this to u which is also going to be some sort of image okay where you need to output pixels so again you have some set resolution and your neural network can only operate at that particular resolution what they're doing right here is the cool thing about is it can operate at any resolution so once you've learned the network you can input higher resolution images or you can output higher resolution images any sort of um you can deal with more resolution less resolution sampled irregularly you can deal with a lot of things once the neural network their neural network is learned and how do they do it by um only ever acting pointwise in the spatial domain so what they're going to do is they're going to take this a and now we get into the more critical things so here a and u aren't just the beginning state and the end state in fact in this navier-stokes example a is a tensor like this so a is going to be a tensor with slices and each slice describes one time step up to a given time so this here could be t equals zero so there is kind of the initial distribution and then t equals one and so on up until t equals uh like 10 let's say i think they do 10. so it they let this thing evolve for 10 time steps and i'm going to guess they do it using one of these classical methods and that's the input so the input isn't just the initial state the input is actually here is what happened in the first time 10 time steps and then the output isn't just at the output at some particular time but the output is actually also a slice right here um sorry a sliced tensor so each slice here describes the output at a particular time so this would be t equals 11 up until t equals 50. okay so this is you so the top one is sort of the conceptual thing but the bottom one is what really happens so they input 10 time steps and they get out the 40 subsequent time steps they predict them all at once okay so and now you can see that in this particular case how i can understand this is at each pixel here i want to know what is that pixel's value after one after like a certain amount of time steps okay like 11 or 50 right here or 40. um and of course the result is going to not only depend on the time 0 but on the entire evolution of time 0 to time 10. so this here is an entire column for that pixel and this is akin to that particular pixel having this many channels so here i can just say well these are technically 10 channels or 11 or something like this i probably screwed up this should be t equals 0 to 9 and then 10 to 49 but um so this is an entire stack this is we can interpret this as input channels right here and we can interpret these as output channels okay so ultimately one pixel is going to have input channels um all the time steps that happened up until the point where we want to predict and the output channels are going to be at the same time all the time steps of what we want to predict okay so these projections now coming back to this they simply work in the channels so these p and q they are one by one convolutions and the one by one convolution simply up project and down project these features you see these are one by one convolutions actually they could be dense layers let's check that in the code later but for sure what they do is they only work point wise so they don't mix the individual pixels together in here you simply get at like a d by d grid with each has 10 channels and then you simply up project that to so here you have d by d times 10 and then you up project that using p to d by d times and here is a parameter that you choose so this is sort of your latent dimension okay and you are going to transform this tensor keeping it in this d by w dimensionality until you back project it using q to d by in this case 40. okay so but this and this they only work point wise and that means there is no particular dependence on the d right here so the next data point could actually have a different d as long as this pipeline right here can handle different dimensions because the p and q only act point wise you're good so what do these magic layers here do so these are these fourier neural operators okay they transform one hidden state into the next note that we have four of these layers so they don't need to be the same as the number of time steps we're trying to predict you see and it's pretty clear from here so um we these four hidden layers they're simply transforming this entire volume right here this entire input volume they are transforming this and as a sequence of latent states and then outputting this entire volume so this down here has nothing to do with the time steps that we're trying to predict it is simply a sequence of computations of latent computations and you know that in a neural network the deeper you make it the sort of more complicated functions arise even though of course the universal approximation theorem says that with one hidden layer you can do anything but in general if you have deeper neural networks uh the more mo you can kind of make more complicated things and uh so four seems to be a good number of complicated for these particular problems so here's what one of these layers does it is very much like a residual network so here you have v d v as the hidden representation at t plus one and t plus one is not as i said is not the time step in the sim in the uh navier stokes sense of time evolution of the pde this is simply the layer t plus one so i don't know why they maybe yeah maybe t here makes still makes sense is it not because it's large t yeah so they have large t right here okay maybe but in the engineering sense it is not it's simply the layer and you can see it's formulated as a function but again don't be like the x right here this is simply the x and y and t uh coordinates so this all of this here can be represented as one big tensor x y t or x y channels or something like this okay don't so don't be confused by the fact that these are formulated as functions so what we want to do is we have two different things so one neuron this is one neural network layer as you can see at the very end is a non-linearity this is a pointwise non-linearity and this is in the original pixel space or in the original spatial space the d by d space each of the things gets a non-linear function slapped on top as is normal then this part is normal as well this is simply a linear transformation of the input again this is pointwise okay so this is a linear transformation so far so good we have a linear transformation of the input and a non-linearity the important part is this thing here so what this thing is this is a kernel function that depends on the initial condition so not only on the last hidden state but the initial condition and sort of is then uh multiplied by the last hidden representation like here and then only x is applied so notice the difference right here this is at a point x we're getting this function value which means we're getting the entry of that tensor and then we're applying the linear transformation okay this makes it pointwise here first we compute this function by this by applying this kernel to the input function so to the entire input tensor and only then we are looking for the particular entry so that means this thing here is a pointwise transformation of that tensor while this thing here it takes in the whole tensor and outputs a sort of new tensor so this is going to be the magic um here where k it goes you can see it goes from a u space to u space uh maps to bounded linear operators on u and is parameterized by theta maybe what's this i don't know i never know so the this kernel we choose this to be a kernel integral transformation parameterized by a neural network so they define the kernel integral operator as this and you can see this is an integral over the uh d is the input space of u and a actually so this is a function that's dependent not only on where you are in the tensor but on the initial uh input this a and then that's convolved so this here is a a integral over the entire space so that's convolved with v you can see that this is a convolution and it's fairly complicated so this alone tells you nothing but luckily they say that they restrict this so it's a bit annoying when things always depend on this a and that means that each of these functions right here each of these arrows right here these are the neural operators actually let's go here each of these fourier neural operators right here they would always also depend on this a here like this and like this is a bit annoying for deep learning because we sort of want one layer's representation to go into the next one so they simply make an engineering choice and say uh nope nope so they say we impose right we impose if we remove the dependence on the function a we impose that the kernel is simply a function of x not only x and w but only x minus w so now you have a sort of proper kernel function in there that we can handle we obtain that 4 is a convolution operator okay it wasn't a convolution before it was just an integral but now if you restrict your kernel functions to this you get a convolution we exploit the fact in the following section by parameterizing k directly in fourier space and using the fast fourier transform to efficiently compute four this leads to fast architecture which obtains state-of-the-art results for pde problems so there's quite a bit of math right here to finally arrive at this thing here so what is all this math for this math is for saying what we want to build our neural network like this

### Fourier Neural Operator [31:30]

this okay and what we do is we simplify and specify this kernel thing until the kernel looks something like this so we restrict the kernel to be a convolution and since a convolution in fourier space is just a multiplication what we can do is instead of taking the function v and convolving it with this kernel what we can do is we take the fourier transform of the function v then multiply it in fourier space by this thing and this thing is now simply a matrix that's learned in as a bunch of parameters and then we do the inverse fourier transform now you might ask why is this relevant why can't we just um do a convolution like we do normally and the reason is so when you do a fourier transform what do you have a some kind of signal and so on you have a signal and you transform this into fourier space and here we just go like one vector so here as you know in fourier space you have these basis functions which are sort of these different parameterization of sine waves or you can do it with cosine waves and they get faster and so on so you know that you can decompose any signal into its basis functions in this kind of periodic function space so this function right here it might have you know 1 times this function plus 0. 1 2 times this function minus 5 times this function and so on so you can describe any of that now for these type of pdes that we're looking for the special thing about them is they are fairly well described if you simply cut away the sort of top fourier modes and only work with these because they are you know sort of the individual tiny ripples you might not want to take into account so you can truncate the lower fourier modes and that's what they do exactly here and they learn so instead of transforming this signal directly into the next hidden representation uh they go to fourier space cut the top fourier modes they have a way of making the next representation in fourier space and this is this r here and that is simply a weight matrix that they multiply with and that is you can prove that is the same as convolving in or in the original space so multiplying in fourier space is the same as convolving in the original space and so they multiply the green numbers right here by r then you get something out so i should maybe this is way too much so the green numbers you multiply by r to obtain new green numbers so maybe r is the is 2 4 so the new green numbers would be 2 0. 4 then you do the inverse fourier transform so you get back to a signal now with two times this so it might be bigger and. 4 times so i can't even draw but you sort of get the idea you put it into fourier space um you apply the function r which is a multiplying by a matrix that you learn in fourier space uh you get new fourier coefficients you map them back and there you have your next layers representation almost okay so this is this fourier neural operator and it's described right here what you do is you take your representation your hidden representation you put it through a fourier transform which you can do in a differentiable fashion you get these fourier modes um which describes how to decompose the signal into these periodic functions you throw away the top modes which is your sort of regularization you apply r which is in a dense layer of neural net not even that it's a multiplication okay by a weight matrix um and then you obtain this these new fourier modes you do the inverse and then you have the next representation almost what you do is we saw this before a pointwise transformation in the original pixel space so this is very much like a residual network right a residual networks they also have this they have the implemented as one by one convolutions um so and then at the end you apply the non-linearity what is good about this two things first of all throwing away the top fourier modes is very advantageous to these types of problems that we have right here you can see that the little jiggles right here they will be sort of sorted out by the larger scale movements of the fluid so throwing away the top modes is a sort of a regularization it helps with generalization and it's very easy in fourier space so these things other than natural images are described well by these fourier spaces and that again is an engineering choice so you cannot not apply these things to everything you can apply them to where this type of assumption holds second of all this is now fully independent of the discretization of the input okay because when i take a picture and i sample it in a three by three gate i can do a fourier transform and i'll get all of these numbers right here okay it's just uh you know the fourier transform does a good job as possible when i sample it in a seven by seven grid like i sample it super densely i do the same for transform i get the same numbers right here okay and it's not exactly the same so they always claim it's the same it's not exactly the same of course if you don't sample densely enough your fourier transform isn't going to be as accurate let's say so ideally you want the fourier transform of the real signal of the real underlying signal but since you sample this you can't have this so there is a bit of a difference but it is independent so that's true so the function r that you learn simply operates on these fourier modes and these are fairly independent of how regularly you sample of course more regular better but still fairly independent um yeah so that's good so if you have what they're going to do is they're going to have something like the 3x3 during training and then sample more densely during during inference which is something you can do but understand that this is just it's just a form of interpolation right uh so the inverse fourier transform simply uh gives you whatever you want interpolating using the fourier modes it has and of course given a certain number of fourier modes which is quite small for them i think it's something like 8 or 12 higher resolution at some point doesn't help you anymore because you've cut off the high resolution fourier modes i guess what can help you is this thing right here but only acts point wise so you see this is now fully independent of the discretization of the signal which is a cool thing so the two cool things about this entire stuff is that first of all independent of discretization second of all these types of problems that we are having here lend themselves very well to be described in fourier space um yeah so that's why i'm saying this is for a particular type of problem and also there are a bunch of other things you can see right here you have this entire input tensor right here and this entire output these can be fairly large right and all the intermediate representations have to be kind of at d by w um so this is you can't go infinite time right here like you could with a classic solver like a numerical solver all you need is the last time step right you go what's that t equals 1 then at t equals 1. 1 1. 2 and so on you just count up and you just go always from the last time step to the next time step here since it's a neural network during training you need to keep all of these tensors the intermediate things i guess you can do gradient checkpointing but this is engineering-wise you predict all the future time steps at the same time so you can't really go infinite in time and how do you train this thing um you train it by simply giving it one of these a right you have a bunch of a's so these input tensors a data set and where you always say here is a one of these navier stokes equation sorry type of problems i've sampled it somehow and i've let it run for 10 time steps and then longer you so i and here are time steps so this t equals 0 to t equals 9 or 10 let's go 10. and here is t equals 11 to t equals 50. okay so you have a data set and this data set is fully computed by a classic forward solver so you can't replace the forward solvers right yet because you need them for generating training data right so this becomes your training data this becomes generally your x and this becomes your y and now you're learning this neural network this entire thing to give you x to y so you see you still need the classic solvers to produce the training data that's the first thing the second thing is you can pretty clearly see that um the good thing is that now we can input any a so the classic solvers you need to rerun them for each initial condition now we simply train with a bunch of initial conditions train the neural network to predict what happens then and then it can generalize to other initial conditions but you know about generalization that the problem is we can only trust our neural network if the problem we're considering is very similar to what we had in the data set it doesn't arbitrarily generalize okay so that is you know it it's something to remember so i said all of these things have trade-offs trade-off one there is you have to predict all time steps at the same time which is hard on your memory right it limits the size of things you can do trade-off two you can only really trust the neural network if the problem you're considering is within your data set vicinity there are other problems that we've mentioned problem three we've made very specific choices with respect to how our kernel looks that it's only ever dependent on x minus y so therefore it is a convolution um there there's all these channels you know engineering choice more you cut off the top fourier modes which um limits the types of signals you can analyze uh the next choice is the number of intermediate computation steps right here which limits the complexity you can assume and so on so there are just i'm not saying you don't have choices in the other numerical solvers you probably do but um just remember there that this is the case so someone might say well can't you just if you want to predict for longer time steps you could make this t equals 11 and then simply you know not go in slices of one but maybe going slices of a hundred so this could be t equals 111 211 and so on and that is completely valid um what they actually do is they subdivide the space further so instead of doing like 40 time steps they are doing like 80 time steps but still times 11 to 50. i believe um the problem with extrapolating like this and leaving away time steps is that see here you have a supervision signal in your training for each of the times and it might be that the fact that so you know time step 15 looks something like this and i know i'm trimmed to mness time step 16 is just like a small evolution like this from right it's like a small difference and it could be that the neural networks because they don't have internal dynamics right they don't internally like dynamically simulate this physical system they simply learn to map things to things and if they are still related to each other a lot then sort of they can make sense of it so if one slice so this could be the slice 15 and this could be slice 16 if these are sort of related you know it can it can make sense there is a relation between them also you can implement this as an rnn and then also from one step to the next it sort of makes sense you don't need an internal dynamic simulation however if you jump from time step 15 directly to time step 115 right then it might look like it might look nothing like it right because it has evolved so much and there can be quite chaotic dynamics and that's the entire problem with pde is that the dynamics can be super complicated and not easily predictable so here you don't really have a relation right and so since the neural network doesn't do internal dynamic simulation it probably wouldn't i'm going to guess something like this wouldn't work too well i could be wrong but i'm going to guess uh classical solvers are still needed for this type of situation so that's the other limiting factor is that you sort of are bound to data samples that can be statistically correlatively predicted from one another without having to do um these physical the real physical underlying simulations though i've been proven wrong in the

### Experimental Examples [48:15]

past all right so they talk a bit about how the fast uh fourier transform plays into this and there is actually an interesting thing which we'll see at the code and then they have three examples um like the darcy flow burgers equation and navia stokes equation and they also do these base and inverse problems where i believe the um what here what you have is sort of a thing at time step you have the bottom thing given at some time step and then you want to find out the original thing and what you do is you have like an algorithm that is simply guessing so you have a u given and you want to find out the a so the a is unknown so you simply start with a zero and guess what u is going to be from that a zero so you evolve your state a to u and then if it's not entirely correct you try again you try a1 okay what does that give me now you um so you see you kind of play a game of guessing and you have an algorithm that does this guessing kind of smartly so it says ah now that's not the direction i want to go to it's sort of a reinforcement learning algorithm a little bit and the important part is it needs to do a lot of these forward evaluation right it needs to change a little bit and then evaluate and see if the u that comes out is the same as the u that you want so you want to find the initial state of any given evolved state and if you need a lot of forward evaluations it's going to be a problem if the forward evaluation is really slow like these classical simulators so these neural networks can really help right here and i think they bring it down they bring down the time it takes from 18 hours or so to two and a half minutes for this entire evaluation so that's pretty cool and they also outperform actually in terms of error they outperform uh these kind of baseline methods so this is pretty cool as well so not only are they faster they also are less error prone all of this is pretty cool now let's just spend like a short time to dive into the code is still

### Code Walkthrough [50:35]

quite a bit quite hacky but that's research so deal with it so here you can see that the top class is what is called this net 2d so net2d i always i like to look at the forward pass uh before i look at the how the network is made uh because you understand how things flow so in the forward pass you simply have this conv this convolution right here what's called conv1 it's not really a convolution right this is this um this is simply an instance of this simple block and x is just passed through it so this simple block right here by the way the data is prepared as you can see there is quite a bit of preparation going on so you have a and you have u so a as you can see is prepared as an s by s that's the discretization of the grid uh by t in so this is your d by 10 like this is ten input time steps and it is already expanded to a t tensor so the t is going to be the output steps that we're going to consider so here a is going to be transformed repeatedly into a tensor that ultimately will have t output time steps you can see you have to hold one of these things in memory for each training sample and then you annotate actually x and y and t these are like positional encodings uh for if you know transformer positional encodings these are simply linear position encodings for x y and t you concatenate those and off you go so where were we x was forward passed through this simple block 2d what's the simple block 2d block 2d is this thing right here so again let's look at the forward pass so first of all we're going to fc0 which what looks like a fully connected layer we're going to permute the axis then we're going to through conv0 w0 a batch norm and a relu so you can see this right here is what we saw in the diagram x1 and x2 are the different paths through the network this is the top path if i go back to the paper quickly this is the top path in this diagram okay and the bottom path is this thing right here and then there the two are added and then there's a batch norm which is not in the diagram and then there is a relu okay so the bottom path is pretty simple and you can see right here by the way they restructure it that this is going to be pointwise so this is not going to be in pixel space this is going to be a point wise only in the channel the transformation so these w's are implemented as one by one convolution you see it's a 1d convolution and the kernel size is one so all these does is for each point in the grid space in the pixel space for each pixel we're going to take this all of this pixels channels and transform this into a new vector of the same amount of channels so you can see the input channels and output channels are always the same dimensions so actually this entire network right here operates on this width which is this latent dimension it's only the first layer that transforms this from 13 which is 10 plus the three positional encodings to this latent dimension and then the last network this transforms it from the hidden dimension to 128 for some reason and then 128 to 1 which is each pixel has a one-dimensional output which is this vorticity that you're trying to predict um and by pixel here i mean an xyt entry okay all right so um yeah so exactly so this goes from 13 to 1 and then it is reshaped again of course to the uh to the appropriate size to give you all of the outputs okay so you can see this is the input this is the output down here in between we have four blocks of this um upper path and lower path so the upper path sorry the lower path we just saw is a one by one convolution and the upper path is this conv0 so this con 0 is this spectral con 3d fast okay and it's parameterized by these modes so the modes is how many of these fourier modes you want to retain we saw we throw away the top fourier modes whatever they are and the modes here is whatever you want to retain in this case is set to uh four which is actually eight if you work it out and we'll see why so this spectral con 3d fast again let's look at the forward pass so what does the forward pass do it does a fourier transform a fast fourier transform and at the end it does an inverse fourier transform okay so this is certainly we are now in the top spar part right here fourier transform and at the end inverse fourier transform and now this r in the middle is implemented a bit weirdly because of how the fast fourier transform works what you get basically you get an image out of it um not you got actually a 3d thing but you get an image and the important fourier modes are not like at the bottom or at the top the important fourier modes are actually in the corners right here so what you want to cut away is all of this middle part if you want to throw away so this is equivalent to uh throwing away these high frequency things right here so that's why this is implemented so weirdly um you can see that here uh first we are going up to the modes in each of the x y and t direction but then we're also going from um here we're going to the last modes in this direction with all the others this is corner one this is corner two this is corner three and this is corner uh four sorry the bottom two right here is corner four uh it's a bit weird and we don't have to actually do this with eight corners which you might have guessed because why don't we do it with modes three you see modes one and two they always appear negative and positive and you would guess we need to do the same thing again with negative modes three but we don't because this thing here is one-sided which um because this is con because this is um a has a property of um conjugacy uh a lot of these entries of the fourier transform would actually be sort of symmetric and the one-sided only gives you one part of these symmetries such that it doesn't waste memory and it does so for the last dimension so this dimension right here doesn't have this corner properly it's a bit weird and you need to know the exact implementation of the fourier transforms but you know that's what it is um so you can see that this mole 3d here is a it's complement 3d it simply multiplies the input which is the signal right here by these weights the weights as you can see is simply a weight matrix that is in channels out channels modes modes and two because it's complex numbers and you see in this multiplication um that the this is a complex number multiplication so the real parts and the real part is this the imaginary part is this and the operator is an einsum operator i just thought this was funny it says mixes yoxes boxes just as i challenge everyone to make einstein some notation that spell cool words bixis yoxy's boxes um but the important part here is so a is going to be the signal which is going to be a batch in channel and then x y t uh b is going to be the weight that comes in the weight matrix which is in channel out channels x y t and you can see pretty clearly in the einstein notation also here that the input channels are multiplied away so these are summed over and what results is the output channels so this is basically a matrix multiplication for each of the samples in the batch and for each location xyz it's a multiplication summing over the input channels resulting in the output channels this is pretty standard transform mapping vectors to vectors it's complex it's in fourier space but ultimately it's just a multiplication so this is the code they simply do four of these layers um going to fourier

### Summary & Conclusion [1:01:00]

space and then back again two why do they do this because as we saw they throw away these higher modes right here and that also limits severely this applicability so if you only throw away the higher modes if you just do everything in fourier space you severely limit yourself in fact these fourier methods they are already not really good for problems that have like non-periodic boundary conditions so the periodic boundary conditions um case is as i understand one of the easiest cases and so the applicability would be limited and the authors hope that by sort of doing this in the real space all the time and also having these encoder and decoder networks that they can retain uh sort of this informations and and be applicable to more than just periodic boundary conditions um yeah exactly and that's basically it um i was ranting for so long i think we are through to this paper so maybe a quick summary because this was a bit of a rant right so you want to predict these types of things are well described by their fourier analysis so transformations in the fourier domain actually make more sense because the evolutions of these thing is more or less kind of these uh global signals it's not localized like natural images like there's the cat and there's something these this pattern right here it will repeat you know as you go into infinity these sort of patterns will repeat and repeat so the sort of global interactions between these periodic signals is much more important that's why it makes sense to go to fourier space to transform that in fourier space you can regularize by throwing away the higher modes and you get the additional benefit that you are discretization independent so you learn the function once and then you can input differently discretized signals as you choose and the function stays the same because the fourier transform it will do as well as it can with the discretization that you give it once you're in for ea space you simply have a multiplication and it's actually interesting the filters here the author shows some of the filters that are learned so on top you see filters in a cnn and on the bottom you see these filters these fourier filters learn these are actually as i understand these are transported back to the pixel space so we can understand them so you can see that the global kinds of patterns that these fourier operators are sensitive to compared to the cnn filters which just have like localize a certain pattern so this is quite interesting so it makes sense to go into fourier space there are a number of trade-offs you have to do you um specifically you have memory requirements and you can only predict signals that are similar to what you've seen in the training data set um and you could only solve things with periodic boundary conditions but by means of architecture of these encoder and decoder networks at the beginning like the p and the q and the fact that you always carry through and a residual way the uh pixel space signal makes it such that you might get around this you might right it's not a proof but there is a possibility that you might get around this in total this thing is way faster and more accurate than uh baselines and has applicabilities and is sponsored by the uh nice people at the military all right so this was long i realized but i invite you to check it out the paper is technical but well written if you stick this kind of math part out in the middle it's pretty cool all right check out the code and i wish you a good time bye-bye