# Easiest Reinforcement Learning Explanation You'll Ever See! 🤖

## Метаданные

- **Канал:** Python Simplified
- **YouTube:** https://www.youtube.com/watch?v=sgsQZmlJbSY

## Содержание

### [0:00](https://www.youtube.com/watch?v=sgsQZmlJbSY) Segment 1 (00:00 - 05:00)

Imagine you wake up in a maze. You have no idea how you got there and no one tells you where to go. You can only move forward, backward, left, and right. What would you do? Welcome to reinforcement learning, where we are forcing AI to learn from trial and error with no teachers, no rules, and zero instructions. But how on earth does it even work? Well, that's exactly what we will learn today. I will show you simple real life examples of agents, penalties, and rewards. And we will also explore the deep Q-learning algorithm using very basic Python code. So, no scary formulas and no complex math. This video is very much beginner friendly and its main purpose is to help you understand the big picture. before you move on with all the hands-on exercises. And finally, this video is brought to you by Hopspot, a powerful platform for marketing, sales, and customer relationships. We will talk about them more shortly. So, if you're ready, let's roll. Imagine a giant maze full of diamonds and gold. It has strict security with electric [clears throat] walls and only one way out. And in this maze, our AI model comes to life. It has no map, no instructions, and no idea what to do next. The only thing it has is a tiny little window called a frame from where it sees a very limited part of the world. And by world, I'm not talking about the planet or the galaxy, but only the maze where our model lives. Outside of it, there is nothing. So, what would the model do? Well, at first, it will pick a random move. And let's say that it will go to the left. In our case, it will hit a wall and get electrocuted, which is not pleasant at all. So, it will say that was a bad move. And then it will try something else. It will go backwards and hit another wall. It will go forward and hit a third wall. So imagine the relief when it finally goes right and finds a giant chunk of gold. At this point, our model has made four moves. The first three ended up with penalties, while the fourth one brought a very nice reward. And then next, since going to the right worked really well last time, then our model will probably try to go right once again. But oh no, another wall. So let's try another direction. And let's repeat this process time and time again. So that eventually after a very long day of exploration, the model will reach the end of the maze. And then we can calculate how many moves, penalties, and rewards it got in total. So let's say that in the first attempt, the model made 150 moves. It collected five diamonds and 10 chunks of gold and it got electrocuted 55 times. We call this type of start to finish attempt an episode. And as you may guess, there are many, many episodes before our model truly understands the maze. So, let's say that in the second episode, our model finished the maze and only 75 moves, but it only collected two diamonds and only five chunks of gold. Not good. We need more treasures. So, let's do it again and a million times more until it finally learns how to collect the most rewards in a minimum number of moves. Okay, but if the model is doing all the work, then what are we developers supposed to do? Our job is to build a world where the model lives, officially known as the environment, where the environment is basically a simulation and it includes a set of actions or possible moves that the model can take. So jump, kick, move in different directions and so on. We also have a set of states or different situations that the model finds itself in. Basically, everything the model knows about the world at this exact moment. So, for example, the state includes things like the position of the model in the maze or how many diamonds it collected so far or even the last few frames that it saw. And of course, the most important part of our simulation is an agent. Our AI model that takes actions in the environment and jumps

### [5:00](https://www.youtube.com/watch?v=sgsQZmlJbSY&t=300s) Segment 2 (05:00 - 10:00)

from one state to the other. And speaking of agents, have you ever wondered how some people run entire companies completely on their own? Marketing, sales, accounting, and all of that while being a one-person show. Well, one person and a few AI agents. For this, check out the AI Agents Unleashed Playbook for 2025 from Hopspot. It is a free 43page guide that helps you delegate all the things that we developers aren't exactly famous for, like generating leads, marketing what we have built, or figuring out when it is best to let AI take the wheel. This guide breaks down business operations into simple components and it shows you how AI agents can enhance them, helping you turn a side project into a profitable startup. My favorite part is the list of common AI agent pitfalls. It covers seven major mistakes like over automation or poor data integration and it shows you exactly what not to do when setting up agents. So download the AI agents unleashed playbook through the link in the description and a huge thanks to Hopspot for providing it free of charge and for partnering on this video. Now let's go back to the tutorial. Another responsibility is setting up something called hyperparameters. They define how the agent learns and behaves. So first we have a learning rate or alpha. It defines how quickly our agent adjusts to new information. So, it's not about how fast it learns, but how quickly it updates itself based on new conclusions. So, for example, you just learned that coffee is bad for your sleep. So, do you cut it down completely? Do you reduce the quantity to maybe one cup a day? Or do you wait for more information before changing your entire routine? where drastic changes means a high learning rate and minimal changes mean a low learning rate somewhere close to zero. We also have epsilon that defines how curious our agent is or how often it takes random actions. So at the very beginning we start with a high epsilon close to one meaning the agent is very curious and it shoots in all directions but as training goes on the epsilon slowly decreases towards zero. So the agent is becoming more confident in its choices and it relies less on randomness. Another important hyperparameter is a discount factor. It balances the weight of immediate rewards against future rewards. So for example, if the agent goes right, it gets a cookie right now immediately. But if it takes 10 more steps in other directions, then it will get five cookies. So the discount factor helps decide if those extra steps are actually worth it. Great. So if we play with these settings, we can really tune our agent to the simulation and to our needs of course. But how exactly does it work? We usually start by choosing a learning process or a Q function as in quality function. One of the most popular choices for our maze explorer is deep Q learning. And this is roughly how it works. So step one, our agent sees a small window of the maze getting several frames at a time. So if a single frame is like a photo, then combining several frames is more like a tiny video. It helps the agent capture motion and direction and not just static snapshots. And then together with a few other details, these frames form the state. Then our agent picks a move, sometimes random and sometimes intuitive. Once it figured out the move, then the agent takes action. It updates its point of view and it receives feedback from the environment. Sometimes the feedback is a reward, other times it's a penalty, but in many cases nothing happens. So no reward, no penalty, just a neutral outcome. Then the experience of what it saw, what it did, and what it got is then saved in a collection of memories. Now, once in every few moves, the agent pauses to evaluate itself, and it asks, "Given my memories, and given my current position and state, how well am I doing? " It compares what it thought would happen with what actually happened. And if it's too far apart, then it learns from the mistakes and adjust its thinking. That difference is what we call the error. And minimizing

### [10:00](https://www.youtube.com/watch?v=sgsQZmlJbSY&t=600s) Segment 3 (10:00 - 15:00)

it is how our agent learns and becomes smarter. Now to calculate the error at any moment in time, our agent has two brains. One that keeps learning and changing and another one that stays calm and steady. You can think of it as comparing who you are today with who you were yesterday. So, if your yesterday's self is handling the maze better, then what you've learned today didn't make you smarter and you should probably revert to your older self, your better self. Next, we repeat the process time and time again, looping through all the previous steps of looking around, making a move, getting feedback, storing memories, and of course, evaluating progress. While that happens, our epsilon goes down slowly, meaning our agent is prioritizing experience over random explorations. And then finally, over many, many episodes, the agent's sense of what to do next gets sharper. It finds shorter, safer routes to exit the maze and collect more treasures with fewer shocks. So now that we know the process, how does it look like in terms of code? Well, the full implementation is quite advanced, but we can imagine it roughly by dividing deep Q-learning into several functions. A classic divide and conquer situation. So let's say that we are starting from scratch with a position of zero, no rewards, with an empty collection of actions and only a single frame that we multiply by three because at this point that's the only frame we have. The next two frames, they didn't happen yet. So we just duplicate the only frame we got. We also need an agent with a high epsilon to start with ensuring that its actions are random. And then finally, we need an empty list where we will store our agents memories. Then we feed all of it into a function that looks a lot like this. And it repeats for every single frame. Now after running this function on the first frame we get a brand new state and our environment returns feedback which is basically the results of the action we just took. So if our previous action got [clears throat] us a diamond and it is worth 50 points then our feedback will be 50 and we will add it to our new state. Next, we move on with frame two and we call our function once again, but this time we pass it our new state. But let's say that this time we got a penalty and it is worth minus 10 points. Then our feedback will be minus 10 and our collected reward will drop from 50 to 40. Then we move with frame three and we finally have enough frames to work with without duplicating or making copies. And let's say that this time our action didn't produce rewards or penalties. So our feedback is zero and the collected reward remains unchanged. So now that our agent has collected a few experiences, it is time to learn something from them. So we will add another function responsible for training the agent or updating its decision process. So now we have an agent and a new agent that we will compare in the next function. So then whichever version has a higher score that's the one we will keep and we will let it explore and keep training it further and further. By the end of the process, the agent will find all the possible routes and evaluate which is the most effective. Now, the best part is if we train it on all kinds of different mazes, we can then drop the most experienced agent in a brand new maze that it has never seen before. But instead of getting stuck and just wandering around aimlessly picking some random moves, it will then recognize familiar patterns. it will adapt quickly and it will start finding rewards almost immediately. This is the true power of reinforcement learning. And if you're curious to learn more, please leave me a comment below right now and I will show you how to do it in the real world. So, not in pseudo code or some fancy slides, but in an actual real life simulation. And thank you so much for watching. If you found this video helpful, please share it with the world. And don't forget to leave it a huge thumbs up and all kinds of comments. Now, if you'd like to see more videos of this kind, you can always subscribe to my channel and turn on notification bell. I'll see you soon in an awesome

### [15:00](https://www.youtube.com/watch?v=sgsQZmlJbSY&t=900s) Segment 4 (15:00 - 15:00)

tutorial. So, in the meanwhile, bye-bye.

---
*Источник: https://ekstraktznaniy.ru/video/44541*