Lecture 20 – Learning the feature vector 𝒇(𝒙)

Lecture 20 – Learning the feature vector 𝒇(𝒙)

Alfredo Canziani (冷在)

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (14 сегментов)

Segment 1 (00:00 - 05:00)

Uh, welcome back to class! So, every time you don't understand, it's my fault. It's first time teaching this class with young kids so, you know, I have to get used to you… Uh, the point is, okay, besides joking, I've been giving you a mental model throughout this class with some mathematics, some drawings, some intuition, so you are having again my mental image, mental model, that is reflecting something I've been building so far with you. Then, in the second part of the semester, I switched gears to my graduate kind of content and then I completely forgot about what is your perspective. And so you were like ‘oh what is happening? ’ Like we need more context, yes, I figured, so you complain i understand. Now I have the missing parts, although I want to go forward, I will start today's lesson with a small recap about why we are doing what, and then we can move forward together to understand what else is going to happen. Okay? So, once again, if it doesn't make sense, it's not your fault, just tell me, so I can actually explain myself better. So, today's lesson we're going to be having lesson 20, which is about understanding what we are trying to do with these uh neuron nets which is going to be learning the feature map so let's get back to whatever we were doing with the classical logistic regression we were given a pair so given an X Y pair how were we performing classification how do we perform supervised learning for like classification multiclass classification so we have an XY pair then what transform it with f of x right so we have to compute uh compute f of x which is what does it mean these are going to be the features that are representing my x for example if we were talking about the image where like the presence of the dollar sign it was maybe the fact that I use all caps so those are attributes right compute f so which is uh the feature vector right future vector which contains which has some attributes of our input X all right next what we were doing let's say we are doing multiclass classification to perform uh to perform a payway classification we need a ways which is going to be W1 W2 yada yada WK I can collect all these weights inside a matrix for practicality right for practicality we can collect them all in the matrix W which is going to be the transpose of this stuff right so W is going to be all those horizontal uh weights then I can compute what finally uh WF what does it give us right if I multiply this weight times my feature vector what do I get oh compute S What is S activation s stands for something multiple things at least either linear sum right that's

Segment 2 (05:00 - 10:00)

one the other one is similarity a third one as well we can call it also score right so score similarity linear sum all of these so finally the linear sum for similarity s vector is equal just w * my vector f which is checking what is the projection of f towards each row of this matrix right yes now you're like surprised why are you surprised beside you that you're a new guy what about the rest of the class why are you surprised about this stuff i'm just reviewing things we already seen before this was third test right you already done this stuff okay fine we have this similarity thing um how can I know which class my model uh says the data point belongs to if I want to do a prediction the arm marks right so if I perform argmax I can figure out which of the capital K classes the model believes this uh point belongs to right so a prediction like uh the model believes that X belongs to class what does it mean yhat the highest right so yhat or we can also call it okay sure yhat equal arg max of this guy here right uh finally the probability output right the probability output the model assign okay model probability uh for y is so we have a vector which we call it y which is going to be the probability that y = 1 given x probability that y= capital k This one comes out from which function okay am I amazing right so argmax is going to be the hard one just to figure out which one it belongs to this one is going to be the soft version where I see how much the model thinks that temple x belongs to each of those k classes right so this is our soft ar max of s which is we know right we have the exponentiated version of s divided by the sum or k= 1 to k of the exponential of s k right okay so this is recap right so we already know about this and Here we just been training a single vector W and then we are somehow given this F we have not computed right we don't know how to where this F comes from someone told us oh maybe count how many times the word dollars appear maybe there is a there's my name in the email or count the number of loops in the digit you're looking at or look at whatever pixel X Y is going to be on or off now the point of using this deep networks is actually to uh learn this feature so what is the what is going to be the diagram from for this guy here so the diagram here is going to be the following i have my f that goes inside a w goes inside a this soft ar max this stuff comes up with a Y which I compare with a cost towards with my Y what cost are we using when we are using this soft armaxy cross entropy cost so we have that our cost for Y and Y tilda is going to be the cross entropy between Y and Y okay so these are this is our circuit so far now we have to figure out where this f comes from and so this is going to be deep learning back propagation comes into help helping us here and so we have that a neural network is a

Segment 3 (10:00 - 15:00)

sandwich of linear and nonlinear operation so for example I can draw my uh neural network here i start from X it goes inside my first linear module then nonlinear n for nonlinear okay nonlinear module then maybe I have another linear and then I may have another nonlinear and so on right and I have guess what what's coming linear nonlinear guess what's coming after linear right so we have linear and guess which is coming out amazing for example I'm going to use this nonlinear function which is going to be what is the output of this guy here white amazing what do we do with vital we compare it against why amazing all right so this is going to be our architecture this part here is our feature extractor and this guy over here is our pattern so this stuff here is the classical linear classifier the one that we have been looking at where we divide our space with lines and then we basically figure out whether you are aligned with our vector or not so this is the one we were using so far but it is not capable of handling a kind of this scenario so this is going to be my x space i have x1 here i have x2 here so maybe I have this one and this one okay so we would like to figure out how to split this space in three but if I use just a linear classifier how can I split this what do I hack what are the decision boundaries for a linear classifier the stuff that we've been seeing so far just lines right remember the multiclass perception so if I just apply this guy here it's going to be trying to do something like that and splitting this in three parts but as you can see those things those data points are overlapping my decision boundary so this is not good so how do we fix this well yeah you have to add more dimensions for example right and so if this was going from my Well this is going to be my x just two dimensions we have to explode this in a high dimensional space which is what our feature vector does okay so the case you saw last time in class was this other case you can pull this point out from the board we go in a third dimension and then we can cut with a plane in straight here how do we pull things out i don't know it's weird uh maybe I just I don't know how to pull things out here very easily right so here I knew I could have constructed a feature vector so in this case I can do by saying feature vector of x vector was going to be x1 x2 then I have x1 2 + x2 so in this case yes I know how to construct mathematically my feature vector such that the things outside will actually come out this third component the F3 will allow me to get these points coming out right f3 this is F3 we understand right this is going to be F_sub_1 F_sub_2 F_sub_3 vector F_sub_1= X1 f_sub_2= X2 f_sub_3= X1² +

Segment 4 (15:00 - 20:00)

X2 you're very quiet what's going on but how about this here how can I learn to unwind this thing and so that's the whole beauty of these deep networks so from here this feature extractor allows me to map this stuff to another space which is going to be for example something that looks like this and I have here F_sub_1 F_sub_2 F_sub_3 and then in this case we have for example the purple one might go like that i have the green one that might go like where is the green next one here and then the last one is going to be the yellow one which is going to be here are these linearly separable yes so we solved the problem how do we go from here to here i didn't tell you but how do we fix this one how do we find how do we perform classification here with this linear classifier where do I draw my boundary like with the axis right so more or less I can do something like that something like this and now I can basically split these ones yeah I can split this in three different regions that are not overlap okay so what's in here how would we change this space into this space which is the our feature extractor so this arrow here is my feature extractor and how do we figure out this specific function that unworks the spiral well two different things we're going to be defining a loss for our weights and the weights going to be the collection of all these W's okay so my the lowerase W indicates in this case the collection of all those W's right there are different W's there going to be maybe one two three right one two three so my loss for my weight the per sample one so for the X and Y pair is going to be equal to what we only have one C and so this is going to be just my C between Y and Y tilda this is going to be vector vector for which we choose since we are performing classification the cross entropy right so the cost finally is going to be the cross entropy between Y and Y so how do we change W1 and W2 how do we find this W1 W2 what do you think how do we trial and error let's keep trying no of course not back propagation is still bad we perform back propagation and what does back propagation does so we use back prop to compute what you read which one partial of L with respect to each okay right so w let's call it I don't know what do we want to call it uh let's call it time and then we're going to be doing what use in the sense to walk to the wall to data right so we're going to be having that w i becomes w i minus ea partial of l with respect to that specific w y okay why is that because we are considering the per sample loss so this is my per sample loss if you would consider the calligraphic L which is the average

Segment 5 (20:00 - 25:00)

across all the data set that would be grain descent or also called batch gra descent we just do it one step for every sample so this is called stoastic because you're not minimizing L so minimization let me write it here minimizing the calligraphic L over a data set right so this one which is going to be equal the average L over all the training set is this green the center is called we can if you minimize the whole uh we're going to be talking about minimizing with gradient descent if we minimize okay if you want to minimize the loss for the whole data set right so this is called doing the sample if you just follow a single L a single sample the stoastic it's not the true it's going to be just one this is the full average this is just for one sample uh minimizing if minimize if we minimize L then is stoastic degree in the right so this is the I guess missing part from last lesson to understand what we are doing right and so I made a video for you that is showing you how my feature extractor does what it does okay so I'm going to be showing you the uh feature vector here so this is going to be my feature vector in whatever number of space in my specific case The architecture I use is going to be the following let me actually first summarize this diagram here so I can just you represented with two simple modules the red one and the blue one right so my high level module a like another representation for this model a higher level representation for the above model it's what i just put the two boxes there right so I have my X that goes inside my picture extractor it goes inside the linear classifier that provides me the prediction which I compare with the cost against my target okay this would be a compact representation of them um this line classifier is just the last module this feature extractor is everything before my last model okay so the network we've been using I've been using for the visualization uh the network um showing the next video has the following architecture so we're going to be starting with two units for the x this going to be my x okay y is 2x why two units for the x you saying it's like your ex here so it should have achieved not which one tell me the names yeah X1 X2 right so X has two components x1 X2 so each point in this diagram has two components that's why I represented here the two items X1 X2 right so this is going to be my X1 X2 then this stuff goes inside a matrix and then I have the positive part and this stuff is without something that is

Segment 6 (25:00 - 30:00)

size 100 okay afterwards after having this 100 positive features so all of these features like this one right f_sub_1 f_sub_2 f_3 or these things here they're all positive numbers because I deleted all the negative numbers from here I have a projection matrix which is going to be a 2 * 100 projection matrix such that I can display something for you on the screen so these are going to be my two dimensional embedding this is and then finally I have my classifier as above okay question yes so we understand that we're going from two I guess like sources of information the two dimensions of X and we're going into a 100 dimension space and every dimension of that space as a value that came from the linear combination of the first two and and oh but that was after the uh only the positive linear com correct so every number I get this negative I set it to zero so my feature will only have positive values so I check basically for each row of this matrix I check the projection of my X against the row if the projection is positive I keep the number if projection is negative I don't care set zero and so this is going to be only the positive projection so f my feature vector the feature vector feature vector f will contains only the positive projections of X on to the 100 uh rows right oh that makes sense that's our choice for feature then we need to get into the embedding space the 100 dimensional space is called the embedding space no going to be my feature vector space my feature space i have 100 features positive features how many features i don't know the better the larger the better okay you can try this over with the notebooks you can try different sizes of feature vectors you're going to see which one works the best uh I really recommend to run this notebook and try to change uh just try everything right try feature vector 2 4 8 uh 16 32 64 until you go up to 100 whatever and then you're going to be seeing how the performance changes it should just always be larger than the uh what's that original space it has really to be couple of order magnitude larger so whenever you go in a high dimensional space things can be moved i can unwarp this spiral if I don't go to a 100 dimensional space it's harder for the system to unwarp it and make it linearly separable the higher the intermediate space this feature space and the more flexibility the model has to pull things apart the lower the hidden representation this feature space and the harder the model has to fight to to get it to to move it so this is the fighting part right so this that is how you change weights by following the gradients these gradients are very nice if you have a high dimensional space low dimensional space these gradients are horrible and everything just doesn't really move doesn't really change you can try like physically run a notebook try different dimension for the feature size for the feature vector and see how the optimization the learning procedure

Segment 7 (30:00 - 35:00)

changes okay so finally what is this video about i'm gonna be showing you in the video shows over time over training right over training f at the beginning how do we have to initialize all these W with some values these networks are usually initialized with very tiny little weights so what happens if I multiply my X points by tiny little weights what's going to be my feature vector if you have a matrix with tiny little numbers and you multiply your input with a matrix with tiny little number what is what you're going to get in the output yes something very tiny right so at the beginning what do we notice what will we notice at the beginning we expect at E expect E to be right why because uh the network is initialized small and then what happens with this stuff here you're going to be growing the weights such that the original random transformation starts doing something that makes sense okay so afterwards again the weights also be shrinking they have to be growing yeah it so the network when it has tiny weights it just outputs zero you can have negative weights too but tiny weights meaning the weights are initialized by sampling from a gausian distribution with a very tiny initial very tiny variance so all my weights are very tiny numbers let's say from negative. 1 to positive. 1 all in between very tiny tiny numbers and so if you have tiny numbers when you multiply anything with tiny numbers you get tiny output but now the model needs to move things around such that they are clearly nicely separated and so to separate things is going to be the opposite of putting them all together so the network starts by putting everything together and then when you learn things the model will try to learn how to separate them okay so at the beginning uh everything is together minimizing L uh makes the grades of row so such that f becomes separable right separable well f how do we get the new random project matrix i get here I have my picture i multiply by a matrix size two * 100 so I just multiply by a matrix that has two rows so all these 100 dimensions get projected down to two just random metrics at the beginning and then the model learns how to use it how to align it with the actually direction of maximum variance if you want the full explanation yeah e vector is here so I need to plot things on the screen i cannot plot 100 dimension can only plot two it's two dimensional canvas oh that's like the output two dimensional uh this is going to be a two dimensional output before sending it through my linear classifier so if my linear classifier needs to linearly separable separate those E also E need to be linearly like they separable and so the model will force these 2D points to be linearly separable but it can only do that

Segment 8 (35:00 - 40:00)

projecting these features from 100 dimensions down to two so first it moves these points in a 100 dimensional space such that everything is nice and far and then once everything is very nicely placed in a dimensional space you can just push it back on this two-dimensional space and now everything is just light very separate let me show you the video oh hello okay never mind sorry I'm a child so this is my embedding space at the beginning what do you see these are the axis this one axis this other axis how are my embedding cluster all together zero are these things linearly separable no okay so I'm going to be showing these embeddings after I train this network for 2,000 epochs you can see here the counter this is epoch zero and I will go I will grow up to epoch 200 all right up to here basically after 250 epochs what happened the model just learned to do one single thing extend well yes but it almost looks like the input right it's still tangled so the first thing is just learn the identity function more or less remember my spiral my spirals were something that is tangled and then this stuff 100 after 250 epochs just multiply the final output but with a scaler so you actually expand it right scalar scales things let's see what has is happening after this apoch 250 oh you see now the center starts rotating clockwise meanwhile the outer side start rotating counterclockwise why is that because our spirus are rotating the other way right and so the system has to figure out how to rotate things in one way the other way how do we plane how do you rotate things on a plane what operator are you using for rotating things on a plane matrix multiplication right if you multiply points by a matrix you rotate them but if I rotate all my spirals what happens nothing right the whole spiral just rotates still spiral R so what do you have to learn how to rotate a portion in the center in one way how to rotate the other side other portion in the other side and so this 100dimensional whatever hidden space allows you to separate things in different planes and rotate something one plane and rotate the other in the other plane okay the more dimensions you add the more freedom the model has to pick up whatever axis it wants the less axis you add the harder it is maybe you just need three axis i don't know but it's harder for the model to figure out I need to use that axis if I give you 100 axis oh you just pick whatever axis you want okay so if you the more parameters Yeah so I'm looking at the background where the square grid is sort of been like warped with correct the warping isn't as like symmetrical in any way seems like there's some areas more right so what is the objective here what are we paying what are we trying to do we are trying to minimize the L how did we define Lum from the L is our first vector and second the similarity class vector point on the uh on the embedding space right so either you can do the uh y minus log y t or we also revolute that as the soft max minus the correct item right and so you're trying to get the correct s the correct scalar thing to be the highest to do that the model basically has to move

Segment 9 (40:00 - 45:00)

those points in location where that is going to be true so let me show you eventually that X you mean just going really quickly back to what point you said about like transforming the space like is it transforming the space or transforming like the one point that um like transforming a point in the original space it transform every point is I don't know but it's a nonlinear one so here you have a model you put things inside and now you force the model to change the location of that point my original points look like something like this if they are like this the soft max min soft max is much larger than the correct class so how do I make sure the correct class is getting as large as the soft max or the maximum well I need to align all these points to what okay this is actually a good question thank you for the question i have a question for you how do you Okay let me write it here i didn't even plan this one but that was a very good question so I can actually improve my lessons so let's figure out what is the loss for the multiclass multi- multiclass uh logistic regression whatever multicloud let's call it there logistic regression we have the loss for my weights x and y equal to what we have soft max of w * e and then minus that's w y transpose e correct yeah sorry my bad now it's correct and this stuff is going to be always larger than zero how do you edge to minimize this subtraction so this is going to be all these products right the product with the first weight product with the second third weight and so on right so this thing here is going to be the vector of W1* E W2* E until the last one W k* E right this stuff inside here and then the soft max just pretend it's written max right so what does it mean max of this So these are vectors these things that are pointed in uh so how what is the dimension of these vectors okay no hold on let me let he got lost what vectors here I'm doing inner product what is the size of this vector the Correct and E what is E k uh that's wrong thank you let's try again e is two right of course is e are these things I show you before here right e are these on here two so E for sure is two y is going to be five right yes here so far e has to be Hold on one sec what is the size of this W3 matrix here inside this LC a by two correct right so the matrix W3 the matrix inside here is going to be five by two matrix right good okay sweet so here I have this one blah blah until the last one and each of these guys is going to be a vector of size two amazing all right so I am in the E

Segment 10 (45:00 - 50:00)

space i have to draw now of five plus yes so I have one 2 3 4 5 amazing so now the question how do I minimize this f what do I have to do with my ease let's say these ws are fixed now my W's are these weights here let's pretend we are not training the classifier we are training this E because I'm changing these weights here in the front so how do you get to minimize this one if I'm looking at let's say Y is going to be the green plus assume Y is green you have a XY pair and then Y is green so I have a point that is here and I have X and then I have a green Y how can I lower this loss you have to do that yes somewhere where like the code in your the green plus vector correct i had to make it align with a green vector so if I move this point here now this is align with this one and I can classify I move it but then other points have to go different directions maybe this point here has to go here okay so whenever we are learning the feature you're learning to move points around so when you're moving that one green point right you're not just moving that one point you transform the whole space that one point yes because these networks are smooth functions so and now actually it's a very good point uh if I have a very tiny network and my feature space is very uh tiny let's say I want to bring this chair here I but I'm having this happening if space is tiny I move this one but what you said it's true everything just comes back Everything draw exactly the point very good so in this case f was tiny 10 10 dimensional now instead of using f 10 dimensional I use f 100 dimensional now I can move this one and they stay very far they don't have they don't drive themselves together so the more fs the more large the f is the more freedom you have to move things around the smaller the f the more when you pull something everything just follows back and so the network is like very confused like it doesn't know how to do what you asked it to do because everything moves together why is the case because if you just do a matrix multiplication to rotate the space everything rotates so uh linear operations are global operations how do you make a linear oper operation less global then we add nonlinear nonlinearities but then if you don't increase the dimensions everything is still like going sturdy like everything goes like together instead the the super cool thing is that we kind of not really understand very well is the fact that whenever you have a high dimensional space things becomes very separated very loose and so if things are loose you can move them no problem so this is major point here uh although there is other stuff appearing I don't like it okay all right so I haven't finished playing the video i even haven't start teaching today's lesson but doesn't matter all right so what happens next again here we just learn identity and now it learns rotate right forward right way rotate

Segment 11 (50:00 - 55:00)

left leg counterclockwise and now boom it just did what I said it was going to do it was bringing points touch that they are well they are what they are align with oh what are these how do you know they are I guess can you describe what you see here okay hold on how do you call this item here vector okay we can yeah call it arrow usually but okay so you can see five different arrows right which are definitely showing you this one two three whatever five different is it decision boundary what's the decision boundary so decision bound to the separator lines that split the colors so this Okay you're asking about these things here right so these are called decision boundaries okay the decision boundaries are what separates different regions how did I compute those separations it's like orange between uh it's here no here sorry it's okay it's written on the blackboard i thought it was like between two vectors it's like the line that bents correct if you have two classes then you have single vector and a single vector has a plane that is orthogonal that is binary classification yeah then we move to multiclass classification where you have capital k different weights how do we perform inference how do we decide how do we figure out which of these capital k different classes the model believes that data points should be belong belongs to should belong should be assigned to is it not just the arg max oh yeah it is the arg max yeah of the one with the class vector yes the matrix of all the class vector yeah right so if you compute the arg max of this vector is going to tell you which one has the largest inner product which is exactly what the color here represents that's how I computed if you check the source code you should of this the argmax of every So here I have every point starting from here to down here so it's going to be like a all possible x and y coordinates i compute this core i compute the multiplication between all of these right so I have a list of I don't know 10,000 points that are all the points from here like whatever 0 uh 1 0 2 0 3 0 then uh 0 1 uh 1 2 1 and so on right all the points you understand all the coordinates I send them through the model well this thing here and I check what is going to be the score the projection against this five kernels this five weights and then I check the arm max so all points that are here have the largest projection the largest product against this green guy they are the one most aligned right if you can check the alignments right if you take from the center or here take this vector you check with this angle here it's going to be the one that is the closest with this angle now these points here have the highest projection with the purple that's why this background here is purple so the color of the points is going to be the original Y color the assignment meanwhile the regions is going to be our linear classifier whatever this LC linear classifier is doing is going to be splitting our space into five different regions through these five different weights the model just learns first how to rotate these weights because you can change the final weights but also learns how to rotate and move my features F or in this case E so both the

Segment 12 (55:00 - 60:00)

E and the W can change the W will most likely place themselves such that they maximize space uh the what's called like space between them it wouldn't not make sense to have all these weights pointing the same direction right if they are all you're just using a cone instead if you have all these weights that are basically almost equally u distributed across the circle then you can actually split my space in the most efficient way right if it were just half the space you have whole circle just use the whole circle and so the overall point of today's lesson was to understand that meanwhile in the before lessons when we are using a single layer neural network we are just learning to orient these vectors and expect the points to be linearly separable in a deep neural networks using back propagation and grain descent you can also change the location of the feature points or in this case the embedding i show you the embedding because I cannot show you the features right if I the ideal would have been showing you what happens in 100 dimensional space i don't know what's happening in high dimensional space it's too many dimension to plot so I plot it in two dimensions and so here finally we learn that the model just moves things around such that my linear classifier is able to split the space in non-over overlapping uh part non-over overlapping sections questions thanks for asking so I can actually so so all the points in the same color has like same art map with a specific vector the color here is the RMAX the color of this uh region is RMAX this color is the AR marks this is ARM marks the color of the points are my uh is my Y my data points so each point here represents three numbers the original X1 the original X2 and then the Y the target in this case this was a yellow point so it belongs to class yellow i don't know the coordinate of this one i this one is the output it's going to be I'm showing you the E so these points here are whatever comes out of this last part of the model and then yeah the whole thing is that rewinding the whole thing down here we started with basically outputting nothing and then as you train a model it grows the weights it learns basically the identity function until here and then here afterwards it learns to undo the warping part in the center okay this is different from the this one here this is the same uh network but what I'm showing you is the interpolation between the input space X and the embedding space at the end so this is not training this is my model is already trained and I'm showing you how the original space gets morphed so all points in the input sending them through the model gets stretched and pulled such that again every point that belongs to the same color gets placed in a same uh part of the plane meanwhile before they were all kind of intertwined here the model just shows you how to undo that so this is the points going through the model that is already trained finally you're going to see once again the arc match at the end it's going to be exactly the same duck i guess these one are a little bit more space than the other one but again more or less they take the hot space finally let me show you another model okay so this is same architecture two 100 positive part two five first video training this video input output interpolation of a pre-trained model all the vectors of kernel that we said yes I we I'm been calling these kernels uh it's just for another word for my uh classifier vectors we can call them class vectors and every matrix when I perform the multiplication I multip multiply vectors against the rows or I can call the rows current that are

Segment 13 (60:00 - 65:00)

basically these uh directions I'm interested in but another word for the same thing so finally so this is both networks 200 25 next network I just have two two22 I force my model to stay in a 2D space so I can plot it on the screen is it going to work no why it's going to do whatever I show you before here with the chairs if I don't go in a high dimensional space what happens is that everything just gets stuck together so let me show you this one so this is the architecture slightly wrong notation but again my input then I have uh some output here like I have one two three four uh activations layers then I have my final embedding so I can display and then in this case only three uh three classes this is wrong symbols uh this would be y the soft arguments so what happens is at my model oh what happened here can you see what happens how was the previous animation compared to this animation or how is this animation compared to the previous one there's like nothing happening oh well things are happening we're repeating the wrong way oh the other one like smooth yes the previous was smooth this one is like chunky guess what this chunky transformation are linear transformations and so what you see here is a pie-wise linear transformation i have this global rotation that are applied to specific regions and when you do that in order to make something inseparable the model has to pull like super hard and as you can see if you pull super hard things start growing too much and see even the data points you can see the separated points even here it really pulls super hard these weights are super large this is so bad model and as you can see all the space is like completely you know skewed but it is somehow able to separate although I had to spit blood for making this stuff to work so this kind of worked oh what happened why is all right let me show you what each model each module does so the first one is going to be a linear transformation second positive part so it kills the negative things then I have another linear transformation then it's going to be killing the negative stuff then I have another linear transformation is this one is so large the weights are so large and then here I almost scal everything down here huge transformation so this is like towards the end so you can see the more we move towards the head of the model towards the end the larger these weights become right so they are really large and here it really stretch them up so even the reo is not able to pull it back what happened here so what happened here is that I didn't even use I didn't use a complete reu it's called leaky reu something that is not set into zero but set into a small value and then the model does the last transformation last linear transformation and then bum now it's linearly separable but it's horrible each model does something more specific it does a rotation it does a squashing another rotation there is a reflection bias and then there is the positive part rotation reflection scaling rotation reflection Bias positive part final and so on right so these are the uh like I split a matrix multiplication into its component so all a matrix does a square matrix is going to be rotation reflection scale scaling rotation reflection refraction and then there is bias so I just show you each module positive like linear transformation and then I show you how we cancel up all the negative values with the positive part question yeah so I thought each like segment rotation that's one line for like reflection rotation one linear transformation

Segment 14 (65:00 - 68:00)

accounts for a rotation a reflection a scaling and at the very end you show some data piece wise so we're saying add up all oh right so this one I show you what happens each single layer since every layer has two units I can show you exactly what each layer does what the initial animation was showing you was the interpolation between my input space and my embedding space so I show you the combination of linear and positive part linear positive part linear so this combination of functions appears as it performs a linear transformation on specific region of the plane namely this region here under goes one line transformation a second here goes linear transformation and so on so whenever you have a network that is made of positive parts and linear functions you're performing pie wise linear transformation of the in space here you can clearly see this because I have very tiny uh hidden layer the all the hidden layers are very tiny and so it has this very dramatic effect but then if I explode this high dimensional space the number of regions that I can draw increase dramatically and so you have it appears as it's a linear is a smooth transformation it's not it's still the same piece wise linear but it has so many tiny uh splitting folding uh points folding lines and so it appears as it is smooth it's not smooth amazing so today was recap lesson we said that we need to go to make it work nicely in a high dimensional space so guess what happens if we are going to be inputting to this model not two dimensional points but one megapixel images how can I go in a high dimensional space if I already start with one million points is like one point of one million components i have to go to what billions of hidden space how many computation is that too many so next lesson we're going to be going in the future 1990s and figure out how Yan Leon became famous because he was smart about what he did what when he did because he had hardware that sucked he had no GPUs he had no data so he had to be smart because if you don't have resources then you had to be mindful and figure out tricks so this is going to be the next lesson thank you so much for listening bye-bye

Другие видео автора — Alfredo Canziani (冷在)

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник