CSE 519 --- Lecture 22: Topics in Machine Learning (Fall 2024)

CSE 519 --- Lecture 22: Topics in Machine Learning (Fall 2024)

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (16 сегментов)

Segment 1 (00:00 - 05:00)

okay let me get started um first question is there any discussion about projects or anything like that if anyone has any questions about projects I'm happy to talk with you guys at the end of class um okay what I'd like to talk about um now is uh the you know I'd like to continue talking about um machine learning methods okay and uh the you know last class we talked about decision trees I'd like to talk about um today about naive bays and support Vector machines and hope maybe get into neural networks okay um so again there's different machine learning methods they have different um you know different advantages and disadvantages um and let class we talked about um you know about boosting you about in the context of decision we talked about boosting was a technique where you have a this presumption was a lot of weak features okay that could help us classify help in a classification problem but you're um and the question is kind of how do you combine them um and in a lot of text classification problems what was traditionally the feature given a text document was the frequency of each vocabulary word okay often you have documents where you're trying worlds where you're trying to classify documents from uh positive and negative classes the classic example was how do you separate spam from real email okay um and the simplest features that you have when you're um trying to do those classifications are words okay and um the the basic feature you would like to use for um classification when you've got two types of documents might be for each word what is it that is um is uh each vocabulary word what is its relative frequency in each corpa this is a simple tabulation to get if you have a bunch of documents that are spam and non-spam you can count how many words there are how many occurrences of each of the 10,000 most frequent vocabulary words how many times they occur in spam non-spam when you div divide that count by uh the number of words total that are in the spam documents and uh non-spam documents you get a rate for each word a frequency which is kind of going to be uh what fraction of all words in the document is this particular vocabulary word when you have the frequencies of different words okay in different corpora you know you can kind of compare which words are specific to a corpor or not um so for example um if we were thinking about spam in a Spam thing often they're trying to sell you something perhaps 1% of all words in a Spam corpora were um the word sale whereas in the nons span corporate perhaps half of that many were um were sale the relative frequency here okay governs how often is that term being used in that corpor okay any questions if you believe that then the log odds ratio of that word gives you a sense as how different the words is how unusually over or under represented the word is in one corpora the frequency in corpora 1 divided by frequency in corpora 2 okay by taking the log we take the ratio and it becomes something symmetric okay any questions about that I will tell you that right now I have a project I'm working on with some people in uh

Segment 2 (05:00 - 10:00)

in sociology where we are analyzing Twitter biographies so um you know you have uh you know you guys know what Twitter is or was okay and you may have known that there were it was a for a long time it was very easy to get Twitter data off of an API and you could find out you know kind of essentially would easily get a research could easily get 1% of all the Tweets in the world okay um with each tweet when you called it on the API you got not only the um what you call it the uh the Tweet the actual contents of the Tweet but you also got the social media biography self-description string that the person who was tweeting it gave do you guys know what I mean by this self- description thing you guys when you get on Twitter Facebook you have a short you're allowed to describe yourself so my sociologist collaborator has got hundreds of millions of self-descriptions of people and you can start to look at how do people's descriptions change okay if you're describing yourself that tells something about how you see yourself if I see your biography over time I see something about how you've changed okay if I count over all of society how much do I do things change the other day we were looking at um we also know from Twitter people sometimes past put their location on Twitter so we know people what state they were in some of them when they posted their stuff and we now um can look at if you been following these people for 10 years like we were some of these people moved okay and now we are able to kind of study in these biographies for any state what is different about the people who came into the state from the people who left the state okay so this is kind of you know you guys if I'm looking around the room most of you were not born in New York State I guess all okay so you guys are examples of people who come into the state okay and there's also people who leave what is the difference can you tell anything about the difference of the kind of people that do that well the natural tool for that is the odds ratio okay we know for every word in uh Twitter biographies how often they are appear in states that are in people who arrive in state okay and how often they appear in the words that leave the state okay and by looking at this odds ratio we can start to get some sense of what is different about the people coming in than coming out might there be political things do people who leave a state are they politically Express themselves differently than people who AR in the state relative to the you know kind of the Baseline background in the state one thing we find of course a lot of the people that move States happen to be students that's one thing you see very often one thing that's over represented among people who move but does everybody see how this statistic is readily an explainable statistic if I look at the word you know Stone phrase Stonybrook okay is it more among people who are leaving the state or coming into the state that will tell you something about the university okay any questions about that okay so the odds ratio gives us a feature as to how much a a weight as to how important that um that vocabulary word is um you know as a feature in distinguishing the classes the word the' probably is not so important the word sale in the case of spam okay probably is okay so under this scheme it's easy to take a document and reduce it to a vector of n Dimensions okay where perhaps each Dimension will tell me what is the frequency of that word in your document

Segment 3 (10:00 - 15:00)

okay and we'd like to now classify that word count vector or any other feature Vector into one of M classes okay to do this we can use the idea from base theorem gave us if again if we label was remember the probability of x given Y and Y given in terms of Y given X if we set the first term to be the class which one of the M classes what's the probability that if x this Vector describing our document is in class I well by base theorem it is the probability of class I again if this is a rare class where almost nobody is in let's say if class I was Nobel prize winning literature okay the probability that any document we see is of that type would be low okay this is the prior times the probability that um we see this Vector given that we're from class I okay and divided by the probability that we that this Vector occurs okay that's what base theorem would say and what made theorem interesting is it kind of turned our question of what's the probability of the class given the input so input given the class okay any questions about that so what is the idea behind you know using base law for class ba Bas law for theorem for classification the class of document X okay or feature Vector X is going to be the argument Max the class that maximizes the probability of the class times the probability of the text the vector given the class divided by the probability of that Vector occurring now what's interest the tricky thing about this I understand what the probability of the classes that sort of tells something about the universe of what of how popular different classes are I understand what the probability of a feature Vector might be given I know that it comes from this class that kind of can come from some compilation of statistics about the class I can understand that the denominator here is one that seems kind of weird what is the probability over all possibilities that we saw the um this particular feature Vector okay and this kind of somehow completes again it basically compl conflates what the probability of each class is and the probability of seeing the feature vector by class this is one that seems hard to to figure out okay but it turns out to be unnecessary because the probability of X is constant over all classes so if we want to make this decision which is the class the most likely class we can ignore the denominator and just take the numerator here okay and find the argument which class maximizes all of those any questions how do we figure out what the probability of a feature Vector is okay well in this case in this particular example we have data about um which days are considered to be Beach days okay and uh you know there is a question of whether it's going to be sunny rainy or cloudy what the temperature is and what the humidity is some of these are Beach days some of them are not by tabulation you can compute the marginal probabilities what is the probability that we see sunny on a day that is a beach day if it's a beach day we have 1 2 3 four Beach days three of those Beach days were Sunny so the probability of it being Sunny given a beach day is uh um what you call it uh

Segment 4 (15:00 - 20:00)

34s what's the probability of it being a uh rainy given a beach day well on all four days that were on all four Beach days none of them happened when it was rainy so that was zero out of three out of four so it's clear that it's easy and on non Beach days was it sunny well there were how many non- Beach Days 1 2 3 4 5 6 only one of them was sunny so that was that marginal probability so given data about what we see it's easy to tabulate these probabilities and determine for each part of the feature Vector what's the probability of that that occurring given the class any questions so how can we compute the probability we can agree that we can compute the probability that it is rainy on a beach day we can agree it's the probability of it being humid on a beach day what would be the probability over all three variables there's a joint probability that comes in we want to know what's the probability of a feature Vector okay that we see given a class okay if the elements of the feature Vector are independent then we can just multiply these um these probabilities okay if they're dependent in a complicated way then that's tricky so the principle of naive base is to say we're going to ignore the potential of trickiness that's where the naive comes in we're going to assume that every component of the record is independent okay and just way score each document by the probabil by the product of the probabilities of of what we are seeing in each field okay any questions about that okay so if we have a set two vocabulary words the probability of seeing um a certain count of one and the other will be the probability of that count for a times B okay any questions so what is the final version of um naive Bas okay we want to compute this the max overall classes we need to know what's the probability of the feature Vector given that it's class I we will say the probability of seeing this feature Vector uh given class I is the probab is the product of the probabilities of seeing each component of the feature Vector given class I okay and this is because we're now assuming Independence and multiplic and can just multiply them okay multiplying a large number of um probabilities is a bad thing so we can take the log of this term the log of probability of a product is the sum of the logs so this becomes the log of the prior plus the sum of the logs of all these different terms okay and this is the score that we will compete compute to figure out how likely that particular document is any questions so again given this situation um what you call it uh you again the probability of a beach given this particular feature Vector basically is just the probability of uh you know being at the beach times the probability of each of these fields given it a beach day and on a nice day the probability here comes out three times the probability there so it's you're three times more likely to be a beach day than a non- beach day any questions okay so what do I like about naive Bays I like that it is simple I like I mean this is conceptually quite simple I like the fact that you can get uh a it kind of is somewhat self-explaining okay you can look at

Segment 5 (20:00 - 25:00)

each component of the record and by comparing the odds ratio in the two classes okay you can see if this is an important or unimportant feature okay and we will talk about and so that's why I think this is a good thing to know about we live in a world where obviously there's a lot of text you know we live in these large language models there's a lot of you can use them to classify text but that's a very expensive thing this is a very cheap thing and in many cases can do a decent job any questions okay good now has anybody ever been to the beach on a rainy day okay someone confess about it this is supposed to be a romantic thing to walk along the beach on a rainy day oh okay I've heard songs about this right the I have been at the beach at a rainy day okay what is bad about these marginal statistics if we look at this thing okay what is bad about this in our data we saw four Beach days none of which happened to be raining what is the probability of um rain of be of rain given that you're at the beach as observed here that probability is zero okay now what if you happen to live next to the beach what if the Outlook is it's you know what if the temperature is hot the humidity is low somehow when it's raining okay don't get me this thing okay the problem is that when you multiply what happens s when you multiply something by zero a long chain of things by zero what is it when you multiply it by one zero it becomes zero it you notice that the vector can't be saved by anything else once you multiply it by zero okay and this is a this is actually a big problem in when you develop data sets Okay that are done by tabulation how many people what is the probability the next word I say is going to be defenestrate this is the same way to do it does anybody know what the word defenestrate means this is one of my favorite words does anybody know what the fenestrate means can anyone tell from uh from the picture here the fenestrate is to throw somebody out a window okay there was a famous if you go to the city of Prague there's a famous event of called the defenestration of Prague where apparently they threw the bad guys out the window okay so what is the point here the fenestrate is a rare word but rare words you're unlikely to see in maybe have never seen in the whole Corpus that you have seen so far okay if you have an observation OB obervations or amount of data tend not to accurately capture the uh frequency of rare events okay and so there needs to be some kind of a technique when you're coming up with tabulations of counts how do you adjust for rare occurrences okay there was another case of this that was kind of funny LL I know you the man behind the transform once asked what's the probability the Sun is going to rise tomorrow okay who here thinks the one probability is one okay I think it's almost one the time will come when the sun will not rise right it will at some point be a supernova or something like that or there'll be a change in political Administration I don't know what's going to happen Okay but the probability of the sun rising tomorrow is not one it is close to one the probability my next word is going to be the fenest rate is not zero but close to zero you need some way of adjusting counts okay observed counts so that they do properly account for this long tail okay and this is this idea of discounting okay there is a technique here that is very useful whenever you have a dat to set where you counts okay is to use something called

Segment 6 (25:00 - 30:00)

statistical discounting okay and there are different um ways or arguments that this can be done so it's somewhat of a Phil philosophical thing but the simplest technique is something called add one discounting okay where before you collect any data you add one to all possible outcomes okay including what you haven't seen so suppose you are drawing balls from an N okay you have an ear you don't know what's in there you pull out eight of them five of them are red three of them green what is the probability the next ear ball is going to be something other than red and green okay you know you could say we haven't seen it and we have no way of estimating that or if you use discounting you say there are three different possible outcomes there is red there is green and there is other you're going to add one to each one of these things so that uh that um the number of times we have seen red is 5 + 1 green is 3 other is 0 + 1 and so the probability a a a better estimate for the probability of what we're going to see next is 1 over 11 okay red would be 5 + 1 6 over 11 Green would be 3 +1 4 over 11 any questions about that the good thing about this is now you don't get probabilities of zero what happens when n is very large suppose you have huge numbers of counts what difference does this adding one make yeah when you have a big enough data set what difference does the adding one make very little okay when you don't have enough data okay when it's small okay it does you know affect the probabilities a little bit but more than most importantly it doesn't turn things into zeros okay any questions about that okay so this is something to keep in mind because this is a you this is a useful thing any questions about naive bays and discounting or anything like that okay there is another technique which I think we mentioned in here um for uh trying to build classifiers that in the pre-neural network era was kind of the hot thing in machine learning okay was something called support Vector machines now what is the idea here we have seen that we often are faced with class trying to build a a classifier between two different classes of items we have seen that um you know a that what we want is a decision boundary kind of some kind of a cut through the feature space that separates one class from another we have seen support Vector uh you know what do you want um logistic regression gave us a linear classifier with things like um you know with with techniques like decision trees and uh and nearest neighbor things okay we saw that nonlinear classifiers could be more powerful the criteria for a support Vector machine is going to try to build a linear classifier okay between two sets sort of like logistic regression it's going to use a different criteria for what a um desirable um separator is okay the logistic uh regression used as the criteria we wanted to maximize a score which had something to do with the probability of misclassification okay support Vector machines or purely geometric things they

Segment 7 (30:00 - 35:00)

are going to seek the separator that has the largest margin they want the linear classifier that separates finds the biggest separator between the two of them and puts the separating line right in the middle we talked last a few times ago about where that separating line should be and there's different philosophies this support Vector machine philosophy is that you put it in the middle the geometric middle well you're going to make it less likely you're going to misclassify something any questions so where is the nonlinear what is the difference here it's again it's if we take a look at it as a picture okay well let's maybe let's go back so what is the optimization problem that tries to get solved in um in support Vector machines it has a lot to do with things like linear programming I don't know how much you guys know about linear programming and the Simplex method and that kind of thing but basically it um in linear programming you are faced with a bunch of constraints okay that have to be satisfied and an objective function okay what is the constraint that we want our linear classifier to solve okay well the linear classifier is going to be governed by some weight times each variable plus another additive weight on the you know at the end The Intercept type thing and this is going to be the W and the B is going to define the separating line or L between the two sides what do we want we're going to want to say find the weight okay such that when we multiply it by the um what you call it the the uh actual dimensions of each point okay times the S the class here is going to be represented by either a positive one or a negative one remember we going have a class two classes we're trying to find a CL a separator suppose we have a positive class we say y of I is one the negative classes Y ofi is minus1 Right multiplying by y of I times this is going to turn if it's negative that's the equivalent of turning around this inequality does everybody kind of get that idea if we had the positive thing is greater than one we multiply it by NE -1 it's got to be less than one okay so what is the idea here okay we're going to have a bunch of inequalities okay and we're going to say our weights times the prediction the difference between um you know what say what the weights times the prediction is going to be our value of whether or not we are uh you know we where we are compared to the separator we want okay this constraint is going to ensure that for any given XI the weight we get from executing this differs from the separator by at least one on the right side so this constraint is going to say that if you're positive you're going to be above this plane at least a distance of one from the plane and if you are negative okay you're going to be below This Plane by a distance of at least one that is the constraint okay and what linear programming takes is um what you call it is a set of constraints and a uh objective function and finds the W and the B that will um will uh satisfy all the constraints while maximizing this okay and you know there are some technical reasons for why you want to um minimize the uh weight some Norm on the weight

Segment 8 (35:00 - 40:00)

Matrix but the important thing is this saying is going to say I'm going to try to find um a a set of Weights that classifies all the points correctly and then also you know kind of regularizes that um those weights to the extent that I want to I can any questions about that so what's interesting about um about L uh this support Vector machine thing it's looking for the separating point the separating line the biggest Channel you can draw between these points one thing to notice is that the only thing that matters are the points that are kind of on the boundary here okay in two Dimensions everybody body see that this point which is kind of in the convex Hull of all the dark points okay this point on the interior is never going to be kind of one of the touching points from the channel if you think about what a channel is the channel is there's a line between the points you're expanding it as wide as you can eventually you got to Bunk in what's going to stop you from expanding more because you bunked into a positive example how far can you expand it on this side until you bunk into a negative example okay so one thing to note about support Vector machines is that it doesn't um Whata call it doesn't the interior points inside the convex Hull of each thing all of them can be discarded those examples don't tell you anything it's only the extreme points okay of each class that are involved in um in the optimization criteria here any questions about that okay good so what happens we said that logistic regression constructed a uh regression line okay support Vector machines construct a separating line okay in this particular case okay both of them um you know they're both separating lines they are different the support Vector machine is what that line is between directly between what amounts to this example and probably these two examples it's got to get pinned by a total of uh of three okay example points so the support Vector machine line is probably pinned by this one and that one okay any questions okay so logistic regression values all the points support Vector machines value only the the points at the boundary any questions now what would happen if I think I can use the pen can I use the pen let's see if I'm right uhoh trouble let's say there was a blue point okay what would logistic what would logistic regression do logistic regression was going to try to optimize the probabilities here it would say oh my goodness this is my line this one I'm misclassifying by a lot I'd better shift my line over right what would support Vector machines do when it had a point in there okay what would it do it would collapse and die okay why is that okay remember the formulation that was okay let's see if I can clear this now clear the formulation for hold on for um support Vector machines was like a linear regression was a linear programming thing you're given a bunch of constraints that you can't violate right and it says find the best possible weights subject to the

Segment 9 (40:00 - 45:00)

constraint that you can't violate the classification right if you give me points that look like this okay there is no linear separator between the two there is nothing no line that would satisfy all of these constraints okay so this is one difference between um what you call it uh as formulated support Vector machines okay assume there has to be a separator okay and it basically finds the best separator given that there is a separator okay any questions so that doesn't sound so good but what do we know okay we know that if we have enough Dimensions there is always a separator between things okay suppose I give you okay two points in two Dimensions one red one blue is there a separator between them yes okay why because you've got two dimensions and two points there's room for this kind of thing but over here it's clear that by the time you get the Four Points it's no longer true that there has to be a separator in this right so what is the idea that support Vector machines use to do this they take your data set in whatever number of Dimensions you give it D and they you want to blow it up to n Dimensions if you have n dimensions and um end points okay there is going to be a linear separator just like there is a linear separator for two points in two Dimensions okay so what is the idea we take our D dimensional thing we blow it up to end dimensions and then use support Vector machines to uh come up with a separator now from when we talked about linear and logistic regression we did talk about already techniques for adding Dimensions okay if you have a set of points like this okay where there's one class is an inner circle Outer Circle if you add the features X SAR and it's and y^2 to X plus y that kind of blow blew this thing up into this kind of a space and as you can see that if you did that there is a linear separator now between the red and the uh and the Blue Points the trick to make support Vector machines useful is that there that it comes with formulaic ways of creating end Dimensions okay and then doing that kind of separator so how can you take points in dimensions and convert them to points in Dimensions okay where n is your number of training examples okay one way to do it is to for every point take n dimens let's say each point was representing a city okay you've seen structures like this that given a particular sign will say it's this far to Tokyo this far to Shanghai this far to Daka okay what is the idea here if I have end point then for any input Point okay I can build a vector that will tell me an N dimensional Vector where it tells me you know X of I is going to be the distance from me to City I okay and if so you now have an N take even though the points are here in only two or three dimensions okay you now have an N dimensional feature Vector okay based on distance and now you have n input points there's an N dimensional feature Vector

Segment 10 (45:00 - 50:00)

okay by definition end points and in end Dimensions have to be separable this linear program linear programming like idea will find the best separator and that is what support Vector machines are any questions about okay I think if you is that all seems quite plausible to me there is one other area of magic that uh that that comes with these support Vector machines is in it would seem like I would have to do this take my end points and build my n dimensional feature Vector okay so I actually explicitly have to build end points and Dimensions but by using you know using functions and using uh algebra appropriately okay there's ways to do this without actually building up the whole feature Vector you kind of build up what you need as you try to find the separator okay any questions about that okay so this is why uh support Vector machines are good for class ification they are linear but linear in a nonlinear fun feature space okay and uh they work fairly work out quite well any questions about support Vector machines okay the the last classification method okay or that I would like to talk about uh is of course neural networks okay which are you know today the thing that everybody knows about and talks about um what is the idea of a neural network okay we think as generally being a machine represented by a directed graph directed a cyclic graph so information flows in one side and flows out the other side okay the input values are given here they're the input layer the output value or values are given by the top lay the last layer the output layer and in between there are layers of computations that we say are hidden layers okay what's interesting between nodes on the same on adjacent layers there are edges each Edge we is going to be given a weight a strength which is going to kind of denote how important is to the value in here how important is the value that is in that particular slot okay and learning a neural network is basically learning the parameters the weights associated with each of these edges okay any questions so what are the principles of uh deep the vague idea of principles associated with these neural networks one is if you have a graph if topology of your network which has a large number of edges each wed weight the weight of each Edge is a parameter if you have a large number of edges in your network parameters okay so what is good neural networks can exploit can take advantage of large data sets because they have large numbers of parameters that's one good thing now we note that nearest neighbor methods also had large numbers of parameters in the nearest neighbor method you had what you were given n times D points it could have been end points Each of which was D dimensions if you had a large number of points you have a large number of parameters and so in that sense there's some level of some vague level similarity how did you find the nearest neighbor in a data set where you had n times D end points each of D Dimensions how did you find the nearest neighbor in general you had to compare your query Point against every one of the input points

Segment 11 (50:00 - 55:00)

right in a neural and that was deemed slow right now what are you doing when you're using a big neural network you have a large number of parameters right and when I give you each individ a an individual point to classify or do does everybody believe you've got to touch every one of your parameters right you're going to be Computing you know dot products and things okay so just like neural inference in um nearest neighbor methods were expensive inference in um neural networks is in some sense the same cost okay if you have the same number of both are sort of linear in the number of parameters that you have okay and uh so the reason we don't worry about this is because people spend lots of money on gpus okay but again the inference cost on uh neural networks is analogous to what you do with nearest neighbor methods okay in some sense the parameters in a neur in a neural network are storing the examples that it's seen before and it's trying to integrate between those examples okay sort of nearest neighbor methods so there's some level on which they are not as wildly different as you might think okay that's one way to think about it what else is good about neural networks well the depth of the networks turn out to be important this means that they can kind of build up kind of hierarchical rep you know representations of the objects that are being they're working on be it texts or figures okay and that you move to progressively higher levels of representation As you move up the network and the other thing of course that's great these days is that there are enough models for doing things like vision and uh and language that for a lot of things in the world you don't need to train your own models anymore training models is expensive it requires data it requires some level of technical EXP expertise homework three you guys did vision and you did um NLP and it didn't require any training technical expertise it required calling an API and so this is a great thing about modern models any questions so as we've said the the parameters in a neural network are the edge weights okay the to have a rich model we like having large numbers of parameters okay when you have uh connect n nodes on one layer to M nodes on another if you connect them completely you get n * m parameters so it's very easy to Define models that have large numbers of parameters in neural networ works so if you had a world where you had a th nodes at every layer and you had 10 layers that would be giving you something like 10 million parameters okay and it's easy to Define it's kind of amazing that you can fit those parameters in an interesting way but it's clear that it's easy to Define these kind of networks okay and the amazing thing about neural networks is that they seem to be able to with the proper Magic trained okay any questions what is the computation at each node in the network each node is going to have a set of um edges to it okay from the layer before it in the network okay each one of those edges has a nodes goes to node I from some other node okay and um it uh what you call goes It I think I've got the I and two i's which I don't want okay call this J so the J node okay has a um you know basically has edges from various nodes I on the previous layer to J each one of those edges i j has a weight each of one of the nodes I on the previous level has a

Segment 12 (55:00 - 60:00)

value we're basically Computing the dot product between the weight and the uh value okay the weight of every Edge to me times the um what you call it the value of the node had going to me okay we have this do product and typically there is a separate parameter called the bias one extra parameter for each node that is a okay the advantage of these uh constants are that uh this constant here is it kind of functions as a kind of a threshold okay so you can set the um what you call it you can set change the behavior of what gets passed on to the previous Network really by just changing the bias parameter okay it's also the case that what happened if we had an input that was all X all zeros anything times zero is going to be zero if you give it a record of essentially all nulls very little information will get passed up to the higher level by having a bias here that is non zero something will get passed up to a higher level even if it's that any questions okay the other component that's important about neural networks is that they in involve some level of nonlinearity dot products or linear things okay multi multiplying something by a weight and adding it together okay um what neural networks do is that they have some kind of they pass the output from every node through some kind of a nonlinear transformation perhaps a sigmoid curve more common these days is a but a rectified unit what is a rectified unit it said for a negative number it was going to be zero and for a positive value it was going to be that actual linear value so what does if we pass it through a relu do any negative values get zeroed out positive values get passed through okay why do we need the negative the nonlinearity this was something that I was when I first heard about these neural networks it wasn't clear to me why I needed nonlinearity okay and what is the example that convinced me it does something good if you don't have not linear nonlinearity the depth of a network doesn't matter okay and let me kind of prove that to you here we've got a n workor of size two it's got four input of two levels bunk right it's got um four inputs it's got 1 2 3 four five six weights okay my claim is that this level Network on six weights can be replaced by this network on four weights that only has one layer what is the weight of it the weight that's going into here this is going to be what the sum of w21 * w11 * X Plus w22 * w21 * X2 we could multiply the constants explicitly and you know if we multiply the constants that each input has along the path we can redu the if there's not a nonlinear term we can reduce the number of um of layers okay and compute the exact same thing and eventually get it down to one okay so nonlinear functions are useful because they give us the ability to create different kind of functions then you can realize just by linear things any questions

Segment 13 (60:00 - 65:00)

yes okay you're saying that if you didn't have a nonlinear function okay if you do you want to say is there a prove that you can't get rid of it okay so there's the question of what can you um do with these nonlinear functions my understanding is that there is a proof that with enough nonlinear units you can approximate any function for some value of approximate and enough okay I guess the important point is that you can create Rich sets of functions much richer than than just linear functions okay so I guess that that's kind of the answer there are these representation theorems that say that you can come close to approximate any function okay given a sufficiently complicated Network what that means for your task given a machine that has a GPU that has a certain capacity is unclear but it is clear that you can represent very rich functions by this any questions about these so let's look at what I find instructive is looking at um to try to understand the processes of fitting a network or something like that look at some very simple Network okay and see kind of what it's doing so these are some examples from a book by Steven Wolfram who um is a guy who did Wolfram Alpha and if you look far now it turns out he and I invented the iPad together if you look look at our Wikipedia pages but um but what's more important is he wrote this nice little book on what how chat GPT works so this is something that's uh interesting to look at and the example is kind of from here what is it that this network is Computing okay the input here had two inputs that's the input layer the output layer is going to be one okay in this drawing the value of the weight is a function of its thickness okay and color okay so there's Edge weights you'll notice that these different edges are different weights this is a drawing of a trained Network and it's saying that if the input was 0. 5 and8 what would it compute okay now how do we interpret this the output node depended upon three different terms okay so what's the total value here it is this weight okay which is w115 times okay a nonlinear F of everything that's below it plus the weight of this times a nonl linear F of everything that was below it plus this term times an a nonlinear F of what's below it and if you work through this term when you say what is the function that is being computed by this neural network it is this linear expression linear looking expression which is only nonlinear because there's that function f okay which is the nonlinear function that we're applying to this thing okay and so when you roll through it um this is kind of what that uh expression is Computing okay and F can be any one of these nonlinear functions the relu or any of these other things any questions about that training the neural network was what was figuring out what the weights of these all these W's were the inputs were X and Y and again at the lowest level what's going to be happening for any one of these nodes there's going to be a biased term plus some weight * X plus some weight time y okay any questions about what the network looked like now what happens when you try to fit complicated functions by a uh Network okay so here what is the situation we

Segment 14 (65:00 - 70:00)

have a plane here where we're carving up the XY plane into three regions the red one the blue region is zero the orange region was uh no the blue region minus1 and the orange region was Zero okay does everybody see that this colored picture defines a function on X and Y it's a nonlinear function because there's no you know it it's some comp you know it's got some sharp Transitions and stuff like that when you try to train it on a very simple Network you get this approximation okay and you know as you know in order to recogn have enough capacity to recognize this function it turned out that once you added a network of this level of complexity it was able to fit this data quite well okay but if you have a a don't have enough parameters in your network or don't have enough layers to have create something sufficiently nonlinear it doesn't quite get what the output is okay so what is the situation you make the data the network bigger and deeper it can recognize more and more complicated functions okay and um you know so the goal is you know this is why networks keep getting bigger and trying to model language better and better the you know the fun output desired Pro function you're trying to do is complicated okay you need a certain level of complexity to recognize any particular function any questions now how do you implement a neural network okay this actually this there's this book that I strongly recommend I read this last summer and this was a wonderful book on deep learning so if you guys are interested in deep learning which I guess many of you are I recommend you get your hand on Prince's book okay it's actually he lets you download the book for free if you go to his website so don't be afraid to do that but among in here here was the implementation I stole from him of um how you would encode a uh train and encode a two-level neural network using uh pytorch okay which was a typical framework so what what's interesting to me first of all it's clear this isn't rocket science this is one page and it's describing how to train the thing so when you have sufficient tools these networks can be specified fairly easily okay what does the implementation actually do first we have to specify the topology of the network okay so in this case there is an input layer there is an output intermediate layer this is specifying the number of nodes on each layer right the input layer is going to be 10 the output five the intermediate laters will be 40 nodes first you've got to specify what is the structure of your network and what is it saying it's saying I'm going to have a linear combinations from uh a layer my the input layer size was five 10 to an intermediate layer of size 50 40 I'm then going to add a reu that was my nonlinear term here then I'm going to add a layer that goes from each of these 40 nodes to 40 other nodes and they're going to be fully connected I'm going to add a reu and then I'm going to compute the linear function from um you know from those 40 nodes to my five output nodes okay so this is what was specified the model there is a question then of how do I start by initializing the weights of my model okay now before you trade on anything these weights have to have some value part of the art of um training a

Segment 15 (70:00 - 75:00)

neural network is properly setting the uh the weights in a way that there's going to be good gradients and that it's going to allow it this is kind of the magic involve it but in these kind of libraries there are um you know this kind of King normal is a initialization scheme that is popular and successful uccessful this specifies how you initialize the weights okay and you then apply the weights to this given model and now we have a model before any training okay then what do I specify what is my loss function in this case I want to minimize the mean squared error that's going to be the law between my training data and uh my current prediction okay I'm going to use stochastic gradient descent okay which remember was when in optimizing these things we want to try to find what is the best set of weights of the parameters given the input we had this idea that we would like to adjust the um weights of the parameters according to um you know kind of the derivative or the gradient to make sure we're going down we're going to say we're going to want to use um stochastic gradient descent it we're going to want to know um you know kind of uh again there's certain parameters here of how we're going to model this thing momentum is about making the decision of which way to go not just based on the derivative of this point but where we were before that keep going try to keep going in the same direction we were going before there was some kind of a the scheduler this is going to specify something about the B the uh the um you know what you call it the uh the this is going to specify the learning rate of my Optimizer so in my learning rate again I was going to go walk in a particular in a uh I'm going to walk in a particular direction okay once I specify the direction by the derivative how far I walk is governed by the Learning rate scheduler is going to change my learning rate depending upon how I'm making progress towards my goal in this particular example this says I'm going to train on random data that's not normally a good thing but this is just an example right so it's saying generate as my training example 100 random data points okay where I have X and Y values okay and data loader is what's going to control that um my my input data and channel it out into batches in appropriate ways there is a batch size okay how many elements am I going to take for each step of my stochastic radiant descent and do I want to shuffle my examples each time or do I want to just uh go through them linearly this gives me a sense as to how do I go through it so these are the basic steps for setting it up what is the actual training of the network going to do it's going to go through the D dat set a certain number of times each time we go through the data set we call that an epic okay we're going to accumulate now for all the batches data that we have in the number of batches we're going to get that data we're going to compute the um gradient okay on the um you know comparing the current model training on the these 10 things compute what the gradient is okay then apply that to update the uh the model and keep going so this is what's doing the actual training okay after we have set it all up okay finally we're going to go and adjust the scheduler okay maybe we're going to it'll decide to change the learning rate okay and keep going until it's done any

Segment 16 (75:00 - 78:00)

questions about how what this code does okay yes why do you not that you're changing it after one ebook every ebook well you you're deciding you've gone through all the data you're now closer to where you are you have some information about how much the error has decreased over the course of this time you know you had how much error you had with the model then you maybe deciding the schedule it would make sense that when you're getting close to the Target you want a smaller step size than when you're further from Target and what this is saying is I'm going to reconsider that the Vision every uh every after I make a pass every time through all the input and then decide whether I want to increase or decrease my learning rate any questions about that any other questions about this so what's interesting to me the interesting thing to me is that this is amazingly little code to in some sense specify it it's hiding a lot of smart things under Library functions all the all the calculus about defining gradients and all the uh you know what you call it the the magic of how you initialize the weights and how you do it but again with modern tools specifying these networks is not you know not that horrible a task any questions about this framework and stuff like that okay the um you know the you know we talked about uh you know in the past about stochastic gradient descent search there we were looking for the um you know what you call it the uh the set of derivatives that would point the way down when we used it with respect to linear regression and logistic regression the landscape of what we were trying to learn optimize for was convex which meant that there was only one local Minima going through this stochastic gradient descent approach would find the global Optima with neural networks we have these nonlinear functions in them the net result is that the optimization problem that we're trying to solve here is no longer a convex optimization problem there are many sets parameters that can kind of you know local Minima which we can get stuck in that said it seems that these methods you know somehow the local Optima if you have enough parameters the local Optima is still pretty good okay even if it's not a global Optima any questions about this any questions about how the neural networks work okay that case I think I'll stop here

Другие видео автора — Steven Skiena

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник