# Python Exercise on Neural Network

## Метаданные

- **Канал:** Machine Learning- Sudeshna Sarkar
- **YouTube:** https://www.youtube.com/watch?v=kTbY20xlrbA

## Содержание

### [0:00](https://www.youtube.com/watch?v=kTbY20xlrbA) Segment 1 (00:00 - 05:00)

Hello friends, welcome to the tutorial session of the 6th week of this course. I am Anirban. Today’s topic is artificial neural network. As you people may already know that artificial neural networks from the foundation of a class of machine learning algorithms which are called deep learning, and currently these are doing wonders in the field of artificial intelligence. So, today I am going to teach you how to make simple neural network on a simple task, and show you how the performance changes when different changes are made to the architecture and the learning algorithm. And the library that we are going to use for machine learning for deep learning is called tensor flow. So, tensor flow is a deep learning library which is open source and it was open source by Google and it is pretty state of the art popular as well both in the academy and the industry. And it has a large number of resources. And a certain you know set of rappers have been developed around tensor flow, which come by the name of TF learn, so it gives the same s k learner scikit learn interface the API interface that you have been like that I have been using in my previous lectures. And that nice interface that model dot just you declare the model, you do a model dot fit it trains the architecture and then you do model dot predicted means a predictions and the same beautiful TF learn in this s k learn API is present in this TF learn library. So, TF learn is a set of rappers, so python rappers on top of the generic tensor flow library and we will be using that in this lecture in this session. It will be pretty interesting. And the kind of work that the tasks that we people are going to we are going to solve in this session is the recognition of handwritten digits. So, the MNIST database is it is stands for so it is the mixed MNIST database. It is a database of handwritten digits, and these digits where actually they were extracted from post postal codes, the pin codes, the zip codes that people write on letters, so they were scanned and the individual handwritten these letters these digits were extracted. And the task is to automatically identify which handwritten digit was written in that particular you know that particular instance. So, here are some examples from the data set, and so there are some pretty nasty looking characters right like this 8, which looks really bad; this may be a 7 or something, so it is not a trivial task of identifying the handwritten digits. And this MNIST database is one of the first choices of data set which people would like to use just to check out the performance of a new learning algorithm that they have come to their mind. So, this is pretty basic stuff and I think that this particular exercise will give you a good head start in these deep learning algorithms. So, without further I do let us go ahead and start writing our code. So, the first step is to load the dependencies. And yes, I am here we writing the code from scratch and in ipython notebook or jupyter notebook, and I will share this notebook on githoc and link in the description, so the link in the description so that you can also use and referred to it later on right. So, let me first load the dependencies so I have to import as np import TF learn. And from TF learn dot datasets import MNIST. So, we just go ahead and execute this. So, once this has been done, this may be I can zoom in a little bit, so that you can see. So, these three things are the necessary dependencies the execution is complete. Let us go ahead and load the data. So, let us declare this as MNIST data equal to MNIST dot read datasets. So, the read datasets is of function, so I just make sure yes. So, the read datasets is a function which is there

### [5:00](https://www.youtube.com/watch?v=kTbY20xlrbA&t=300s) Segment 2 (05:00 - 10:00)

in within the MNIST this that we are invoking, so you can actually look up the structure of TF learn dot datasets. And it has a number of really good utilities, and this read datasets function will read the data set; so if the data set is not there in your computer, you just going to download it first. And then it is going to I do some preprocessing on top of that bring everything into the format that is easy for machine learning and then written the data set in a nice format. So, we will see the format of the data in a minute. So, I will just write one hot equal to true. So what does these do, let us execute and then speak. So, it is already done. So, as you can see that the down datasets were already downloaded in my system. So, they just you know read the files and you have the data here right. So, I will just add a new cell and start speaking, add cell block. So one hot target are targets represented as a vector in which just one of the terms is a 1 and rest all of them are 0. So, it is like if you have 10 different digits to identify then the each digit each handwritten digit may be represented by a 10 long vector, vector of 10 elements. And say it is an image of a 4, so the one hot vector corresponding to this 4 will be 0 0 1, and the rest zeros. So, all zeros except the fourth position which is a one so this one hot representation of vectors like of targets is useful for using along with different kinds of loss functions like cross and loss or even like it is like describing it is a good way of very usual popular way of describing categorical targets so in the form of a binary vector. It is called one hot vector because of just one of those elements in that vector is one, the element corresponding to the entry of that particular class. So, I just said one hot equal to true over here, now let us have a look at the data. So it is going to make the targets in a give written one hot format. So, let us first divide the data set into training validation and test bit. So, maybe I can say data underscore train equal to MNIST data dot train so it is organize this way data underscore validation equal to MNIST data dot validation and data underscore test equal to MNIST underscore data dot test. So, we just did the training validation and test split is and added on the cell and let us have a look at the training data. So, let us say that we load them into the variables x and y, so this is going to be equal to data train, so the training data and underscore images gives is the field within the class this like there is a class call data set. And you can look up the structure; I will give the links in the description of the video. So, this underscore images this particular attribute gives the inputs and data underscore train dot underscore, so it is target labels. So this will, so now you have the inputs and the target values in x and y respectively. Let us go ahead and see what the shapes of these entries are like this we can do x dot shape and see, so there are that entire data set has 60000 thousand images, out of them 55,000 are going to used for training, and the rest will be so have been distributed among the validation and test sets. So, x dot shape and each of these numbers that is you saw over here, so each of these images is a 28 cross 28 image black and white image. So, 28 times 28 make the number 784

### [10:00](https://www.youtube.com/watch?v=kTbY20xlrbA&t=600s) Segment 3 (10:00 - 15:00)

so that is why you can see that the length of each training input vector is 784, and there are 55,000 of them. So, they have been organized in the form of this metrics. Let us see how y looks like, y there are also like for every single of the 55,000 training images, we have their corresponding label encoded as a one hot vector and let us have a look at what this one hot vector looks like let us see what y zero looks like see. So, it is a ten long vector there are 10 entries and the corresponding like this, this happen to be the image of a eight right sorry seven, so it is like 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. So, this was the image of a seven that example of and hence that is the position for seven is one the rest are zero. So, this was just to show you how things look like. So, after this has been done, let us go ahead let me delete this cell, it is useless. So, just go ahead and delete the cell. So, once we have the training data ready, let us go ahead and make the model. So, the tensor flow it organizes the entire architecture of the neural network in the form of a graph, so it is called the tensor flow graph. So, the first thing that we have to do while making your neural network is to first define the graph structure so initialize a graph by invoking this function. So you just call TF learn, and just a second, now, yes. So, TF learn dot in it underscore graph and you specify what all system resources that will be necessary for this particular exercise. And just as a matter of fact that deep learning requires a special kind of you know computing infrastructure which is known as graphics processing unit. So, the same graphic cards that again we that are used for playing games that you can find that is used for like heavy duty, visual renderings, the same graphics cards, so they are high the capable of doing metrics multiplications very efficiently and that is what a comes to a great help in doing this like deep learning applications. So, tensor flow is it is compatible to it is completely optimized for GPU for this graphics processing unit is so how much memory from a GPU, you would like to use in a particular experiment you can specify over here. But my PC does not have a GPU now, so we will just say specify that we are going to use just all of the four cores of my PC. So, I just say that this is the compute requirement of the algorithm number of cores equal to 4 Next, we go ahead and start making our network. We just we store our network within the variable called net, so first we add an input data layer. So, it is just called TF learn dot input underscore data and you say what the shape of the input data should be. So, you just so the shape is in this case is equal to a list first element is none I will tell you what it means followed by 784. So, have a look at this, so the first element in the shape this corresponds to the number of yeah this element none, the first entry in this shaped vector it corresponds to the number of samples that needs to be presented in a particular batch. So, in the tutorial session I have already talked about what is batch learning, what is mini stochastic learning three different kinds of gradient design that you can use.

### [15:00](https://www.youtube.com/watch?v=kTbY20xlrbA&t=900s) Segment 4 (15:00 - 20:00)

So, the first term gives the number of elements that you want to take in each batch and that is variable in this case, so that is why you do not specify any entry over here, you keep it none, so that the graph can be modified accordingly. So and 784 is the dimensionality of the image that you are going to feed. So, in this case we have like factorize the 28 image into a single vector of length 784, so that is why you have this entry, so this is your input size. Next, we are going to add the next layer. And so this is the first hidden layer TF learn dot fully connected and check this index, perfect. So, in the input will be the net, so which layers so when you start you know without the manner in which you build a network in TF learn is by like if you add the layers one by one. So, there is one network object, it is called net, you first put a input layer in then you add the next hidden layer, layer then another as deep as you want to go. And then finally, you have an output layer, and then you go ahead and declare your loss functions and the optimization algorithm and set it to train. So, we pass this net object as an input to the function, so it is going to add a particular layer and the shape of the layer. The number of unit is in the layer is going to be say one hundred we keep it 100 and the activation function of the nodes of the layer is going to be sigmoid, very sorry let us give it relu first rectified linear unit. So, it has already been covered in class that what rectified linear unit are so it is one of the best suited for deep learning for deep neural networks. And it works pretty well and there is a lot of theory about why rectified linear unit is so awesome. So, let us go ahead and add the output layer. So, a single hidden layer neural network and we will do will play using it, play with it, let us see. So, as the number of units in the output layer is 10, and it is just because you have 10 outputs unit is right represented as a like 10 output classes and the output labels are coming in the form of one hot vector. So, the activation here is going to be softmax, so perfect. So, the softmax layer, it is going to it is a kind of logistic regression so multi multiclass logistic regression is called softmax. So, you can actually look up the web and figure out what it looks like perfect. Next, this or this makes our network the networks build is complete. Now we have to define what kind of loss function we want to use to optimize the neural network and parameters of the neural network, and what kind of algorithm update algorithm that we should use, let we want to use. So, we add another layer this is not a an actual like you know hardware layer to the neural network rather it is a specification of the a learning algorithm options. So, it is called regression. So, the regression layer actually does either a linear regression or a logistic regression. So, the first input as always is net, and then you have to specify the loss function. So, the loss is equal to categorical cross entropy, so you can look up the web what categorical cross entropy looks like what it actually is. But I will just few words that categorical cross entropy is a loss function which tries to match the probability distribution of the actual observed data samples to the probability distribution that is getting modeled by the neural network. So, when the cross entropy is actually cross entropy loss the categorical cross entropy loss is actually an adaptation of the k l divergence distance between the

### [20:00](https://www.youtube.com/watch?v=kTbY20xlrbA&t=1200s) Segment 5 (20:00 - 25:00)

statistical distance term, statistical distance metric and it directly tries to match the actual the data distribution and the model distributions in course of the training. So, you can look up and read more about this, but this is outside the preview of this course. So, we will put an optimizer, and let it be stochastic gradient descent. Now we have defined what kind of learning algorithm, and what kind of loss function will be used to train the neural network. Now, we define, we write this so we define our model. So, again now the scikit learn API comes in, so see how beautiful it is, so we are just going to put TF learn dot DNN. So, yes, so we are now that you have already defined what kind of a neural network you want, and how it should be trained and everything. Now you initialize this TF learn dot DNN this is like sk learn dot linear regression or something like that, so a model learning algorithm or a machine learning model, just define this as a machine learning model. And then you train this model. So, I will just leave this spot in this particular cell, and we can make new cell for define for training the model, so it is all compiled. So, what did we do here we first declare what kind of see compute resources we are going to use, then we initialized the structure of the neural network. And said what the different layer sizes should be, what different kinds of activation should be, and then we said what kind of loss function and what kind of update rule should be used to optimize the learning algorithm. To learn the weights rather and then we declare the model. And then it is time to train, oops so I already made another slide another for this. So we are now going to do model dot fit yeah as we always do. So model dot fit, now the model is already declared we train it on x and y. So, these are your input and targets of the training data. And now you specify a couple of more options. You specify that the number of epochs of training in epoch is going to be one. So, what is epoch so it is a very common concept in machine learning, so when you are actually trying to find out the like you want to do gradient decent, you have a bunch of parameters which you want to find optimum values of. And what you do is you show the learning algorithm the same training examples time and over again. So, say you have one million training examples; and in the first epoch, you are going to show the neural network say it is a neural network the learning model. So, you show all the in the first epoch training examples to the neural network. So, it does some updates everything. Now in the second epoch you randomize you randomly permute all the, you shuffle all the training examples you have, and you push it again. And it has been shown that this helps to break the sequence the like when you are presenting the training examples in a particular sequence to your machine learning model to a neural network, then the neural network might actually memorize a sequence. And so the sequence in which different training examples appear to the neural network, may actually you know actually we reflected in the optimization in the values of which the weights can take up. Just to break that sequence what we do is we randomize the samples and pass the same training said over and over again. And this like carries on the training, and we need to like the more you train, the better the models fit is on to the training data. So, it is a concept and I just keep I am going to keep n epoch equal to 1, number of epochs equal to 1 to like show you the other thing is ok. And it is just going to speed up the learning process a little bit take a little bit lesser time. Now you show now it is batch size right, so sorry oops batch size. So, when you as you are doing stochastic gradient decent, the batch size is going to matter. So, it is the number of examples that you want to show per

### [25:00](https://www.youtube.com/watch?v=kTbY20xlrbA&t=1500s) Segment 6 (25:00 - 30:00)

epoch per you know per instance of learning or like after how seeing how many examples you want to do one update. So let us keep batch size is equal to 10, and another option is show metric. So in the progress of training as training proceeds it is going to show us the value of accuracy. Let us just go ahead and see what happens. Notice here, so as you can see that the accuracy is increasing; the training accuracy is increasing the loss is decreasing. And you know it is going to make you divided the entire training set into batches of size 10 and it is like pushing every single batch and then updating. So, one update is happening after every batch has been processed and you can see that the number of examples that have been shown to the neural network is been counted over here. So, when all the 55,000 examples have been shown, the training is complete, so one epoch is complete. And it reports the final accuracy over here. So, let us go ahead and see how things change when we increase the number of or increase or decrease the number of nodes of the hidden layer. So, let us go ahead and make the nodes, so let us make a node that previously therefore, hundred nodes in for neural network with one hidden layer and 100 nodes accuracy was just 69. 5 percent. So, we are now decreasing the number of nodes and we will see how things change. So, I will restart and run all, just see here what happens, it will restart, so the performance yeah. So, now, you can see that the accuracy is increasing slowly because the models learning capacity has been reduced. The number of parameters is reduced, so the capacity to learn has also reduced. And however, the training is progressing much faster, so you can just see that yeah like you know in flash 30,000 examples are processed, because as the amount of compute whenever a parameters has reduced, now computation has been faster. What happened my operation and close the files oops restart yeah just a second run all, do not take much time. So, we can expect that the accuracy will be lesser this time. Let us see, so you see the accuracy would not increase beyond some 26 percent, and this is just a training accuracy. As the models learning capacity has reduced, so this has gone to like 35 percent or something; see 44 percent previously it was 69 percent. So, as the learning capacity was reduced the neural network could not learn well, and hence the accuracy is lesser. So, let us go ahead and rest over the number of hidden unit is to 100 previous values and let us add another hidden layer. See, what happens oops so let me shrink it a little bit and go ahead and add another hidden layer. And see how things change. So, previously it was 69 percent. Let us go ahead and restart and clear output yes. now run all set, let us see how the accuracy changes this time. So, now, the training is a bit slower, because the size of the network has increased. And but we can like hope that this times the accuracy will a bit higher, but it is not certain at all it is should not necessarily be higher, and that is where

### [30:00](https://www.youtube.com/watch?v=kTbY20xlrbA&t=1800s) Segment 7 (30:00 - 35:00)

the entire trick comes in. So, it was like really you know unfortunate that on increasing the number of hidden layers, the accuracy actually fail, so this is the actually the trick of deep learning and it was seen that deep neural network with any hidden layers actually fail to arrive at a good solution. And you can see that the accuracy is a 12 percent, so it just stuck in a bad local minimum and it had to come out. And so let us go ahead and make a small change to the optimizer over here. So, there is an optimizer which is called Adam and it is I will put the link in the description of the video that what it actually does, but it is an you know a sophisticated optimization algorithm and we can hope that this does a better job and gives us a better accuracy. Let us say so see this, it is started with an accuracy of 89 percent look at this. So, what the Adam optimizer does is it uses a different learning rate for every single parameter. So, it is like every single parameter are should be may be needed to be treated differently in course of the algorithm the optimization algorithm. And see you can see just the accuracy see it has touch 95 percent 97 percent, so it has a different learning rate for a different optimizer, different parameter and also the learning rate changes with time. And it is called annealing and it has a separate annealing scheme for every single parameters. So, I will put a link in the description to this paper and it is actually you see there, accuracy is 96. 44 percent. So, these are the different things that so what is you are take away from this exercise, the take away is that when we have a single hidden layer neural network, the more is a number of unit is in the hidden layers the more complicated is the models function - the hypothesis function and hence the more is it capacity to learn. So, given enough data, it can learn better so that is why 100 unit hidden layers gave an accuracy of around 69 percent. Whereas, if the when the number of unit was reduced to say 10 the accuracy drop to 29 or something it was in the twenties. But and the accuracy now the models complication, the complexity of the model could be increasing in two ways. Either you increase the number of nodes in the hidden layer in one hidden layer so you have a just one hidden layer and you put many nodes in that or you could stack many hidden layers one after the other. And it has been shown so what we saw here that when we increase the number of hidden layers, then simple stochastic gradient decent fail to convert show a good local minimum. So, but the accuracy was just the training accuracy was just twelfth percent, so it is close to the it is good to call that it did not learn at all it just got lost somewhere. Whereas, so when you like have so many hidden layers right two hidden layers of 100 unit is each you have a lot of parameters and that is why you need to choose a very good optimization algorithm. An optimization algorithm which gives special care to every single parameter and make sure that these parameters are tuned properly in course of training, and finally, we can like arrive at a good optimum which may not be the global optimum, which it may never be reached at all. So, it may not be a global optimum, but it should be a good local optimum, so that is why when we used Adam optimizer, and this just I told you in few words what it does so I will link into the paper, so that you can read it and understand get the full detail about it. So when we used Adam the performance improved a lot and we got you know 96 percent accuracy. And the current state of the accuracy on MNIST data set is something like 0. 227 percent error, and it is beyond human abilities. So, the machines today can recognize handling characters be better than human beings. And so you can like go to the MNIST, Wikipedia page and there it gives list of algorithms which are doing very good which are the state of the art in this task, but they use much

### [35:00](https://www.youtube.com/watch?v=kTbY20xlrbA&t=2100s) Segment 8 (35:00 - 36:00)

advance concept like convolutional neural networks and others, which you can explore yourself. So, both the motivation behind me introducing you to TF learn and the larger and the like the parent of t f learn which is tensor flow is that just to give you a heads off in that directions. So, if you people are enthusiastic about deep learning if you really want to check out what the magic is all about, how are machine becoming really intelligent like as intelligent as human beings, and beating human beings in they are own tasks. So, if you want to check out how deep learning actually works, so this is a very good tool which you can use for implementing your own algorithms and like there are a lot of like turn of implementer algorithms already implementer algorithms. A lot of resources available for this particular library, and people are actively and Google is actually actively developing this particular library. So, you can use this and you know explore the possibilities of deep learning in your own field. So, I hope you enjoy this video, and I recommend you to take this ahead. Bye, see you in the next video.

---
*Источник: https://ekstraktznaniy.ru/video/29418*