Deep Neural Network

Deep Neural Network

Machine Learning- Sudeshna Sarkar

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (8 сегментов)

Segment 1 (00:00 - 05:00)

Good morning, so now we have in the last two classes in neural network. We have looked at single layer and multilayer neural network, and the back propagation algorithm. today we will talk briefly about deep neural networks. So, this is a very new topic, and with the lot of interest, and I will just give you a brief exposure to certain very basic deep neural architectures and tell you what it is all about. In recent years there have been a lot of excitement with deep learning, deep neural networks, and there have been breakthrough results in various domains including starting with image classification, then in speech recognition; and finally, in natural language processing like machine translation multi-model learning developing text from speech, text from images automatically and many such. And the excitement continues, so I will talk briefly about what is a deep neural network. So, a deep neural network you know very simply speaking is a network which is deep that is which has many hidden layers. We have seen that in we have feed forward networks with hidden layers, normally in a normal network what use to happen is that we also talked about in the last class that a network with two hidden layers is enough to represent any function. But it may not be easy to learn all complex functions using a network in two layers. As you increase the layers in the network, it is easier to be able to represent complex function in terms of simple functions, relatively simple function. So, you can have a network with many hidden layers and you can call this a deep network. Why was it not an attractive proposition to start with? People found that when you have a deep network, the number of the errors surface becomes very rough and there is lot of local minima, so you cannot always converge, the convergence is extremely slow, the slow training. And very important there was one condition, suppose you have this input and you have the different hidden layers H 1, H 2, H 3, H 4 and then you have output. Now, we have connection from this layer to this layer, this layer and so on. Now in back propagation the error at the output was propagated here and this error the back propagation error was propagated here, here, here. Now there are few things most of the error based on this error, these weights where been changed. As we go further here, there would be less you know less updates of the weights in the initial layers, and one of the reasons was diffusion of gradient. When you do gradient decent, you find the error gradient and do a gradient decent, but the gradient can get very diffused, get attenuated as you go deeper from the output towards the input. When you use a function like the sigmoid function, you look at the sigmoid function, in most of the regions gradient is almost close to 0; only in this region the gradient has some decent value. Most of the region the gradient is close to 0. If a small value and as you are taking propagating them backwards a gradient becomes attenuated and very close to 0. And very little change was happening here. They can also be in certain cases gradient can increase very much, and there also there is a problem, which can of course, we handle by simpler methods the gradient clipping, but diffusion of gradient is a problem and the slowness of convergence is a problem, when you use a deep layer network.

Segment 2 (05:00 - 10:00)

And the breakthrough that came in using deep layer network is because people came up with methods which could handle this problem in various ways some of which we will discuss today. First of all let us see why deep representation is good. In many tasks, hierarchical representation help represent complex functions. For example, in natural language processing, we have an input sentence, and we can look at first identifying the words there is and we can look at the part of speech or morphology of the word, we can then identify chunks or small phrases then larger phrases or clauses and then the sentence. And we can process it in this order in the NLP pipeline. When you doing image processing, you can initially find out the edges in the image that from the pixels you can find the edges then the texton, then motifs, then parts then objects, and if you do it hierarchically the process may become simpler. So in deep learning, we are basically learning a hierarchy of internal representation. We can think of that the nodes at every hidden layer are representing certain features. So, this is the input these are some low level features, and these features are composed of this low level features, this is a further compose, further compose, so we have a hierarchy of low level to higher level features. And the learned internal representation can be thought of as features, and we can say that these hidden layers are acting as feature extractor. So, deep learning can be looked upon as feature learning. And one way of looking at it is that we have the input, and here we have a trainable feature extractor. So, these are all feature extractors. And finally, at the output, we have a classifier, if you have a classification problem. Now one of the ways in which deep networks can be trained is by using unsupervised pre-training. There are several ways one is that you could try to setup the units, so that the gradient diffusion or gradient blowing a problem can get reduced. And there are some ways of using activation functions other than sigmoid functions. For example, people use a function called Relu, which we will see which avoids the problem of gradient diffusion. The other approach is using unsupervised pre-training, so instead of learning a deep network at one go, we learn the network one-step at a time using a not to deep network. And by growing the layers one at a time, we can avoid the problem which deep networks have. If you look at this slide, we show the schematic of doing greedy layer wise pre-training. This is the input. We first learn the first hidden layer by using some mechanism; one of the mechanisms is auto encoder, which I will discuss briefly today. Another is by using restricted Boltzmann machine. So, by several such approaches is possible to first learn this layer. When you learn this layer, you fix this layer; after you fix this layer in terms of this you learn the next layers, then you fix it, and then you learn the next layer and so on. So, here you are fixing the parameters of the previous hidden layers, and you are assuming that these are different features, and then you are trying to find the next layer. So this is the idea of greedy layer wise pre-training. After you have learnt the layers individually, finally you can look at the classification problem. And based on the training example, you can setup the classifier at the last layer

Segment 3 (10:00 - 15:00)

of the output layer and then can tune the whole network with back propagation. Initially, you do greedy layer wise training; finally, you put the classification problem using the training examples and then tune the entire network. So, this is one scheme which is followed by stacked restricted Boltzmann machine and stacked auto encoder. So, in deep neural network, so what we will look at is feed forward neural network. Stacked Autoencoder which I will talk about briefly; stacked restricted Boltzmann machine will not talk about in this class, but that is another approach and very briefly we will talk about Convolutional Neural Network. So, this is an example of a deep architecture a multilayer perceptron. We have the output layer. This is the input layer comprising of raw sensory input. These are three hidden layers which learn more abstractive representations as you go upwards. And finally, this is the output. Now we have seen in a neural network, there is a back propagation of error. We have already talked about it. And when you train deep networks, then there are some problems. The early layers of the network do not get trained well, because there is a diffusion of gradient error attenuates as it propagates to earlier layers. It has very slow training and the error to earlier layers drops quickly as the top layers mostly solve the task so not much update happens in the beginning layers. And there is often not enough labeled data. If you have a deep network, it will have many parameters. When you have we have seen that if you have too many parameters too few training example then it may lead to over fitting. However you can take advantage of unlabeled data, instead of using only labeled data which may be restricted in number, you can take advantage of unlabeled data, and this is what some of this approaches to. Deep networks tend to have more local minima problems than shallow networks. So, these are some problems with deep networks. Now as I said that we have talked about the sigmoid function, there is also the tanh function, Relu function, softplus function, these are some other activation function. So, this blue line is the sigmoid function which have discussed; it is between 0 and 1. The tanh function is also an h shaped function similar to the sigmoid function; it is ranges from minus 1 to 1. The Relu function the red function Relu function is 0, when the input is less than 0; and after that it is a linear function. So, this is a Relu function. And this light blue function is the soft max function, which is a smooth function similar to the Relu function. And these functions are unbounded when the input is high they becomes very large and these functions have recently been used to overcome the diffusion of gradient problem in deep neural networks. So, the tanh sigmoid function we have already seen. Tanh function is given by this equation. So this is the tanh function, this is the sigmoid function, and this is the relu function which is given by max of 0 x. It simplifies back propagation makes learning faster. And this is often a preferred option when you have many hidden layers in neural network. Now, let us very briefly talk about the Autoencoder. Suppose, you have unlabeled training examples x 1, x 2, x 3 etcetera, which are all real value of vectors. Now in autoencoder what we will do is that we will learn the initial hidden layers based on the unlabeled units only unlabeled inputs only. How do we set it up? Suppose, this is an input X i, this is my hidden layer H 1, we will set up a problem where X i is the input and output. So, we will setup a network to learn the identity function. And we will have connections from here to here, here to here. So, you can see that this H 1 will act as a representation of the input because from

Segment 4 (15:00 - 20:00)

H 1 the input can be reconstructed. Now you may be tempted to think that we can have a very simple trivial function, we just takes each unit and transfers in there, but we want to prevent it using some constraints. We want to put some constraints, so that H 1 learns some interesting function at the nodes; and not a trivial function. This function will be a representation of the input, but they will be nontrivial by using some constraints. We will discuss about this. Thus, once we learn this autoencoder, we have basically a neural network with one hidden layer; this is easy to train and then we fix the weights, so we fix these weights W 1. Now we remove this. And now we think of given the input we have this value of H 1 and we setup another learning problem where we have the hidden layer H 2. So this H 1, we have fixed; this W 1 we have fixed, so this we have fixed. And we setup H 2, and we put H 1 at the output. So, for every training example we find the value of H 1 and the same value we put here; we generate this training example and then we learn this weights. So, these weights let us call them W 2. And after that we will fix this layer and this weight and we will remove this. So, we will learn layer by layer. Similarly, we will learn H 3 and W 3. Finally, after we have finished learning this unsupervised layers in a layer wise approach now with respect to each training example will find the value H 3 and we will bring in the output. And for the output, we will train a two layer neural network or some other function like SVM or a linear regression, we will train some function. And then we will get these weights W 4. So up to this, we are using only an unlabeled data, here we bring in the labeled data, and then we will get a deep neural network. And if you wish, we can tune this network by feeding in the input output term here. So, this is the idea of autoencoder. So, we have unlabeled training examples we set the target values to be equal to the inputs. So, we have x 1 here we have x 1, x here, x here. And then we train the first hidden layers, so that the hidden layer represents x. Now as I said the solution to this autoencoder may be trivial. And we want to put some constraints in order to prevent it. One of the constraints could be that in order to prevent a trivial mapping, we can set the number of nodes in the hidden layer to be much smaller than the number of inputs. And we can constraint that suppose the input has 100 dimensions, we can put 20 dimensions in the hidden layer. And if a function can be represented by 20 dimensions that is a nontrivial function, so that could be one constraint we put. Another very popular constraint that we use is the sparsity constraint. In sparsity constraint, first sparsity constraint first of all with respect to one neuron, we say that the neuron is active if its output value is close to 1, if its output is close to 1. Now among all the training examples, for some of the training examples, the neuron will active. We can put a constraint on the fraction of training examples on which the neuron can be active, so that is we constraint the neuron to be inactive most of the time, and to be active only for some training examples, which have certain commonalities. So, we can impose this sparsity constraint on the these network, and we can learn the weights which satisfy this sparsity constraint that is preventing the trivial mapping to

Segment 5 (20:00 - 25:00)

be learned this is another way of imposing a constraint. And we are as we saw that we are stacking different layers. It has been found that this spars autoencoders can be stacked easily, whereas the previous constraint where you can strain the number of hidden units, they do not after you know they are not so much amenable to stacking. So, we can add such constraints, and then we can put the input here, learn the first hidden layer. We can fix the weights from the input to the first hidden layer. And then we can after fixing the weights then at the end we can learn the classifier, of course, we can do more than one hidden layer. And by stacking autoencoder we can use more than one hidden layer. This is the first hidden layer then, there is a second hidden layer. So, we do supervise, so we do layer by layer using unlabeled examples. Finally, we do supervise training on the output layer and then we tune the entire network. So, this is the idea of stacked autoencoder and this is one way of learning a deep network, where we avoid the problems the standard problems that is where they are with deep networks. There is other as I said we have stacked RBM, where we use energy minimization to learn layer by layer using bidirectional neural network, but that we will not discuss in this class. Now we will briefly talk about another approach to deep neural networks using convolutional neural network. Convolutional neural network has been very popular for working with images. And now it has also been applied to other radius including natural language process. So, convolutional neural network consists of a number of layers, and these layers have some specific functions. They are two types of layers, which are very popular convolutional layer and subsampling or pooling layer. The input so convolutional layer is an image. Suppose, you know it could be another thing, but let us talk about an image. Suppose, there is an image, which consists of m by m pixels; and for each pixel, we may have a number of channels. For example, if it is an r g b image, there are three channels. So the number of inputs corresponding to the image is m by r. So, these are the different channels - r g b, so this is the image. And the convolutional layer we use k filters, so use k functions; and each of this functions will do a mapping. So, let us explain how this functions convolution function works. For simplicity, let us assume we have m by m image and there is only one channel. Now in convolution what we will do is that we will take a sub rectangle. Suppose, we take 4 by 4 sub rectangle, then we can take different 4 by 4 sub rectangles. Now corresponding to this sub rectangle we have 16 inputs, and this 16 inputs will feed to this corner this is k 1 is one feature. And if there are w 1 to w 16, these are the 16 weights, so this is an output. Now what we will do is that for other sub rectangles also we will go to this function K 1, and they will also share the same weight W 1, W 2, W 16. So, what we do is that in convolution for every node, we will for every rectangle we will so for a particular kernel that we are using for every rectangle we will

Segment 6 (25:00 - 30:00)

find the function. And this function the weights of this function will be the same weights as we are using for all these sub rectangles. So, this is the convolution. So, we are taking a mass and convolve it to every position of the image and we get the convolution output. Now, because the weights are shared the number of parameters is less. So, you can think of you are trying to find particular pattern in the pattern could be in this place of the image or this place, or this place. So, this is the convolution layer. And the output of the convolution layer is n by q, so n can be equal to m or less than m. And q is the number of channels; here r channels. And we can have instead of looking at a sub rectangle, we can think of a volume, we can look at a volume to the convolution and get the output, which is n by q. And we can have several such filters which we call kernels. This is one function K 1; we can have another function k two which will have a different set of weights. So, we can look at this picture here. So, we have this sub rectangles or sub volumes. And for one convolution layer, we are computing this function; and this function will share the weight between the different local fields. Next, what we can do is that we can do a pooling; we want to know if this particular pattern occurs anywhere in the image. So, we can take for K 1, we will have output from different parts of the image then some these outputs can be combined using max or average. If you do max, it means max pooling. Suppose, we take this region suppose this filter corresponds to if an edge is there now that edge can be here in the image, here in the image. So, if you take max, it means does the edge occur anywhere in the image. So, we can do some pooling and so that we can take this n by q, and reduce it further. So, pooling or subsampling, what it does, it takes the output of a function and combines them into smaller value. So convolutional layer consist of rectangular grid of neurons; each neuron takes input from a rectangular section of the previous layer, the weights of this rectangular section and the same for each neuron in the convolutional layer. Now, in pooling, we use features obtained after convolution. The pooling layer takes small rectangular blocks from the convolutional layer, and subsamples it to produce a single input from that block. For example, after convolution, these are the values that you get. Now we can divide it into small sub rectangles, and for each rectangle we can do a pooling. For example, from this sub rectangle we by max pooling they get six; from this max pooling gives us 8; 3, from this max pooling gives us 4. So, after pooling we reduce the image size from this to this. So, here you know this is a digit learn you know recognition of hand written digits task. We have an image, we do a convolution and we use five features. So initially input images 32 by 2, we get 5 28 by 28 images after convolution. Then we do 2 into 2 subsampling, so we take a 2 into 2 neighboring rectangle doing max pooling and then reduce it to 14 by 14 then we again do a convolution 5 by 5 convolutions, and we get 10 by 10 feature maps and again do subsampling. So, we can have a convolution and subsampling layers and then they can be stacked together. We can also have optionally some layers, which do fully connected feed forward neural network like we have seen earlier and finally, we have the classification. So, this is a one of the architectures one configuration of a convolutional neural network which consist of convolution layer, pooling layer, fully

Segment 7 (30:00 - 35:00)

connected layer. Of course, the convolutional layer will be there and there these layers will be stacked together pooling may or may not be there and the fully connected layer may or may not be there. So, let us look at some properties of convolutional neural network. The convolutional neural networks are takes advantage of the sub-structure of the input, which is achieve with local connections and weights. So, locally, the convolution is over a local rectangle. So, it looks at some feature in a local region; and the same feature it tries to find in different regions and this is achieved by tying the weights by using the same weights for this rectangle as well as this rectangle, we use the same weights. So, this results in translation in variant features. Convolutional neural networks are easier to train and they have many fewer parameters than the corresponding fully connected network, because the weights are shared. So, this makes convolutional neural network easier to learn. Then there are many other neural deep neural network architectures, of course, we will not be able to cover everything. I will just mention recurrent neural networks, which are very useful for representing temporal sequences as in speech or sequences of words in a sentence, videos etcetera. In recurrent neural network, what we have is that we have input and we have a output and there is a hidden layer. Now we want to capture the input from the previous time steps. Now one of the ways in which we could do is that we could have the input at x t minus 1, x t minus 2 dot dot x t minus k. So, we could pick a fixed size window and treat it as the input and then we have a hidden layer, and we could have the output. These constraints as from taking input from a sliding window. Alternately, we could take the input theoretically from anytime in the past infinitely back in the past by having a connection from H to H. So this is x t, this is h t, and we can have a connection from h to itself, which corresponds to if we look at this picture which corresponds to you can unfold it and have this is x t minus 1, x t plus 1 and this is the hidden state at t minus 1, t, t plus 1. And this weights are shared this is the common weight. weight and this is the weight u, v, this is the weight w, and this is the recurrent neural network. And you can unfold it infinitely to get a very deep neural network. And you can use back propagation the particular term we use is back propagation overtime to find the value of the weights. Now recurrent neural network have similar problem as deep neural networks have with respect to back propagation. And one of the very nice ideas people have come up with is use some units instead of using simple perceptron, we use certain gates units by which one can store which can act as a memory, where one can store some information, so that information stays till when you want to use it. So, information which is long back in the past can be stored using this gated units, and there are various such units which have been used. LSTM, which stands for long short term memory; and GRU, which stands for gated recurrent unit these are some of the units which are used in recurrent neural network to enable them to work effectively. Unfortunately, today we do not have much scope in this class to talk about this and this is a little more complex topic which we will

Segment 8 (35:00 - 35:00)

not have time to go into detail, but I just wanted to give you a glimpse into this, so that in future you can read it. This entire topic of today’s is a little bit advanced, but we wanted to tell you a little bit about this there are many are very nice architectures of deep neural networks like encoder, decoder architectures, and there are a different models using models which use external memory. And these are very exciting and they have been used for solving extremely interesting tasks. I hope that you will have some interest and be able to study them later. Thank you.

Другие видео автора — Machine Learning- Sudeshna Sarkar

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник