Reinforcement Learning: Policy Optimization Introduction. Reinforce to PPO to RLHF #datascience

1:27:43

Reinforcement Learning: Policy Optimization Introduction. Reinforce to PPO to RLHF #datascience

The Machine Learning Engineer 12.05.2026 66 просмотров

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

In this video, we'll explore RL Policy Optimization — REINFORCE from scratch: math, code, and connection to RLHF. We'll build from the ground up how REINFORCE works — the policy gradient algorithm that forms the basis of PPO and LLM fine-tuning with RLHF. No prior RL knowledge is required. We'll start with MDP and work our way up to functional code in PyTorch. **What will you learn?** ▸ What a policy is and why optimize it directly (vs. learning Q-values) ▸ How to model a policy gradient: state, action, reward, path, and discounted return ▸ Why we use stochastic policies and how exploration arises intrinsically ▸ The log-derivative trick — the mathematical insight that makes policy gradients possible without an environment model ▸ The policy gradient theorem and what each term in the central equation means ▸ Value functions: V(s), Q(s,a), and the advantage function A(s,a) ▸ The four variants of the policy gradient and how each reduces variance without changing the estimator ▸ Monte Carlo in REINFORCE: why it is unbiased but noisy, and how to normalize it ▸ The REINFORCE algorithm step by step — with direct mapping to PyTorch code ▸ How the neural network learns the policy episode by episode (Weight evolution) ▸ Implicit Exploration vs. ε-greedy: Why REINFORCE doesn't need manual scheduling ▸ Why policy gradients are the foundation of RLHF and how PPO extends this **The Repository** A benchmarking suite for policy optimization algorithms on CartPole-v1. Includes standalone implementations of REINFORCE, A2C, A3C, PPO, and TRPO, a unified orchestrator for comparing all methods, and result aggregation scripts. 📌 Code: [link to repo] Key files for this session: - policy_gradient.py — REINFORCE implementation - policy_gradient_benchmark.py — standalone runner - run_all_comparison.py — multi-algorithm comparison - aggregate_results.py — run aggregation - 03_policy_gradient.md — algorithm reference document #ReinforcementLearning #DeepLearning #MachineLearning #DistributionalRL #PyTorch #Python Code: https://github.com/olonok69/Reinforcement_Learning_Policy_Optimization/blob/main/README.md

Оглавление (18 сегментов)

Segment 1 (00:00 - 05:00)

Hi, welcome to my channel the machine learning engineer. Today we have another video on the series reinforcement learning and the topic of today is policy optimization. We will do an introduction we will see the theory how these algorithms in reinforcement learning works and also we will do a demo with the foundational algorithm reinforce based on Monte Carlo and I will explain you how these models are the bases are even they are used with some of the variance because reinforces is the foundational but is not the newest one and has some weaknesses I explain as well during the video is this the foundation of what we know reinforcement learning for human feedback or and also this algorithms in policy optimizations have been used to find tune um personalized models very wellknown models that you probably use like GPT models from Open AI and even DI will cover in during this video. I hope you like it. My name is Juan. I'm an assistant engineer with more than 30 years of experience in the IT sector and the past 10 years well almost 11 I've been working on the AI and machine learning in learning space. Ok let's start with a little bit of recap. of what has been seen up to now we had an introduction this is probably the fit of the video these series one two the fit so we have been done a presentation and introduction of what is reinforcement learning we cover what is model base model free and we do we cover the whole branch of school algorithms. We will not talk about that anymore only just to do some comparison with policy optimization. So just to again recap model free we don't try to learn or we are not given the dynamics of the environment so we don't we are not given or we don't pretend to learn a model which from the environment we just try to create either or len either a policy is well let's imagine that you are in an environment which is a city for example and you want to go from point A to point B so you lear the optimal policy or the algorithm will try to lear the optimal policy to go from point A to point B to from the beginning to the destination in learning this policy is a sequence of different states and actions which bring me or drive me from point A to point B and learning instead to learn the policy the complete policy in a single go just try to understand how good is to take an action in a an specific state. So remember that the policy you are you want to go from point A to point B you are going from state A to state A prima double prim etc or A B C D E etc to you get the point the final point or the final destination and you are moving from different stakes every time that you take an action this we will see here now with the remembering what is an NDP P. So and then on in Qarning we just try to get or to learn which action is the best in or maximize the reward ya the reward in specific state of our policy. So in both cases we don't try to rend the whole dynamics. to the example we are in a city the dynamics of the city in this environment is how was the traffic lights when they open the shops when people house etc all these dynamics we don't try to learn we just try to

Segment 2 (05:00 - 10:00)

learn either with policy optimization the best faster or well depending what is your goal to go from A to B and in learning we try to lear from those specific and sequential steps that we have we learn in policy optimization in our policy we try to learn in the specific state number three for example which actions from the actions available on that specific step is the one which maximize the reward So we covering algorithms and introduction in the previous video. Today we will focus on policy optimizations. These algorithms are purely cool learning. These brands here are purely um policy optimization and those ones you had here are mixed algorithms which tried or get the best from both worlds especially DDPG as is quite well known algorithm and from polish optimization probably you are not well not aware but PBO is very popular today again and it has been use as a foundation of the reinforcement learning for human feedback and also to find tune models like as the GP the GPT models TRPO etanogron is the algorithm which was used to train the physic so let's recap a little bit what is an NDP the NDP is the Well, the core of how reinforcement learning works. has an agent an environment an agent do an action or run an action in an environment and the consequence is the agent in that environment is in an state st and via this action the ag moved from or pass from state t to state t+ one and get a reward positive or negative the whole reinforcement learning approach is learning or teaching sorry an agent or the agent is going to learn how to behave in the environment again either learning or using a policy in the model base or learning a policy how to go from point A to point B how to keep alive in a game for example how to win more money when you are using an agent to trade in the stock exchange how to pin different wires in cavities in uh in projects when you are when you pretend to learn or create an agent which simply try to optimize electrical hardening vehic depending what is your objectives the model in this case try to find a correct for example in in in a trading boat is trying to learn a sequence of buyance sales or position which drive me to get the maximum amount of money or to not lose too much money which is could be the same no or opposite but at the end is the goal no which goal I have and then on in policy we will learn a sequence of bites and hold and cells at different days and then this will be My possible objective or my possible goal is to get as much money as possible. In the other hand, learning we learn interacting with the environment in the day one or in the day 10. What is better? Looking at the variables of the market of the variables uh the observation. Observation is the what the inputs that the agent received from the environment what is best to do by sell on whole in the

Segment 3 (10:00 - 15:00)

initial example when an agent wants to go from point A to point B I in position number 10 in this I had four actions available to left go back or go forward so which one is the best one this is what cool try to learn and again the whole are reforcement learning problems are modeled using the marco decision process where we have an agent an environment we have a stage actions and rewards there is some dynamics So the agent is in state one or yeah and state tion among the available on that state action t and it moves to state t+ 1. the objective of the agent is maximiz the expected discounted return the discounted return essentially the intuition on here is that the future rewards had less weight than immediate rewards what means that the well I had the expectation that the In 100 days I will win and then on immediate reward is the expectation that tomorrow I will win so the weight or the probability in 100 days can happen many things so the NDP the mar of decision process how is model waiting with more weight immediate rewards and long rewards or long rewards are weighted with a smaller factor. This is the discounted factor. We will see this well we will we cover it this already but we will see how this apply to the policy optimizations algorithm. So again the components state action reward discount it discount lambda eh lambda no forget the cck alphabet vaya is sorry gamma how much we value future versus immediate rewards this usually is the kind rule that you apply more value to well more value you can see here how apply the discount return you are elevating to the power of two three f etc all the res that you are getting in the different steps of trajectory is a complete episode here we have an episode is the implementation of a policy how I go from point A to point B is state zero is point Action zero is go forward and then I move to the with regard R0 so and then on this mov me to state one so in state one I take one and I get reward one an episode is a sequence of actions rewards and bring me from state t to state t+ 1 and then on this sequence of the of these two is what we known as trajectory you can do you want to do an equivalence with the how we train a neural network as we see also in learning the step is the step how you do the fit forward in your neural network and you can see is exactly the same is completely equivalent so step in reinforcement learning is in your training and an episode or trajectory in reinforcement learning is equivalent to

Segment 4 (15:00 - 20:00)

appach you can do this equivalence is quite quite to understand because we are usually try to assimilate how to get these values with a neural network and then you can get this assimilation no and then on the discount is retom as you can see the rule when we are working with the reinforcement app this way the reward in the state value the total reward is calculated as you can see in this formula every step in the future is penalized with a discounted factor which is elevated to the power of a number with this consecutive numbers from zero here you can see here r0 is y elevated to zer which is one here r1 the discounted factor y2 is something smaller that's why so usually the discounted factor is a number in between zer to one lower than oa lower than one and then on every time that we elevate to a fact to the power of a number is becoming smaller this way sorry this way the future rewards had weighted with a lower value than the immediate rewards here for example we say that gamma is a 0. 9 so 0. 94 multip by the reward we in step two plus 0. 9 elevated to the square elevated to two to the power of two multiply by the reward of the second st plus etc why gamma elev to the power of 3 multiply by the reward of the i get in st3 etc and so on and so forth So, so the intuition uh of this counter return formula is again the immediate rewards mother more when gamma is lower than one. The ret is the training signal which indicate the trajectory quality and we had two different kinds of terms. Finit horizon undiscounted which is the sum or requiring fixed length episode and infinitive horizon discounted discounteds with a gamma. for convers so depending of how you your environment works you have a continuous episode for example you are driving or you are playing again and then on in this game well let's imagine that you just try to shoot multiple enemies or survive and move to the screen. So there is no fix horizon when you calculate the or how to calculate the this formula and then on in this case we apply the the continuous spaces observation and actions and you can apply without discounted which is not recommended in when you had the finit horizon when you know how many steps are in your in your environment this usually don't happen but in practice what we usually do is just to cut to a number of steps just to say well you didn't complete this environment in a number of step reasonable number of steps we simply a signal of truncate because if you didn't do it in this number of steps not gna do it in simple

Segment 5 (20:00 - 25:00)

world then um why we useastis policy so policy optimization estoastis policies which is essentially a policy of actions states so policy which me the more probably so is a combination of probabilities of actions in specific state. You can see the policy that going from point A to point B we are going to learn what is the higher probability in state let's say number three and which action of the ones we have available on that state is the more probably the more likely which gives me the maximum reward policy the policy is a sequence of probabilities that indicate the action which is more likely to give me the maximum reward in the whole trajectory. Doing this way we include the explore exploration during training. Remember when we were working with Q learning we had to introduce the long grady algorithm which allow us to explore. Explore is to do to try new things. Explotation is to use what we already had learned and to use it. When we are training an agent in reinforcement learning at the beginning the agent doesn't know anything. start to interact with the environment. Remember that as well the reinforcement learning is kind of try and error. This try an error give us well the agent at the beginning start to need to try a lot of things and then many things will be negative reward or unsuccessful. some other things start to get you some small outces positive outces and this is how the agent learns and then how we did in in Qarning in Q learning we set first of all an approach which is external to the algorithm is a anuristics which is implementing via random numbers algoritm explore when the agent exploit again exploration is doing something new so this is sampling one of the actions available on that particular state and when the agent already knows something it has some information on the table on the policy on whatever place when the model when the agent already knows something it can use this knowledge to sample an action based it on the mod on the information that I had on my wage or in my table on the wage of your network I can sample and action from that knowledge this is explotation and then on in Q learning we did it in with two different ways one was implementing the eilon grady algorithm at the beginning of the training we explore a lot at the end of the training we exploit a lot we also introduce the noisy layers to the neural network automatically learn when to explore or do explotation here in policy optimization we have we do we create stockastic policies again for what because this way we learn as well when to do exploration and when to do explotation is similar to the noisy layers that we use in cool learning also it provides a small gradients for optimization and learn preference which is learn preference and not decision so this is some kind of for example the best way to go from A to B is going

Segment 6 (25:00 - 30:00)

through all these states all these streets and taking these actions in this is a whole no an end to end learning so the preference the best option is going through this part this is what we are learning in policy optimization A we just learn what is the best to do in that specific point of my trajectory of my path is turn right turn left go forward go back that's it and then on the policy in case of the Q learning is the combination aggregation of All these better states that we are learning one after the other we create the sequ the policy looking for the best individual actions on that particular steps and in policy optimization we discover the whole trajectory in a single goal. Obviously we don't do it we just single exploration we do multiple trajectories this is what how at the end of the day works Monte Carlo for those that already know or they are familiar with Monte Carlo simulation Monte Carlos Simulation very wellknown algorithm specifically on portfolio optimization in finance where you try multiple combinations in portfolio and you measure which one is the best one according to the shares or the combination of shares that you have the or the asset that you have in your portfolio. How you do that? You prob let's imagine that you had Tesla, Microsoft, Google and Amazon and then on you can write 25% of each on your portfolio. meure with after two years I had $ divided in these four shares at the beginning how much money I or less money I had after two years if just simply holding all the assets in the in the portfolio on the portfolio let's now change the weights for example I had 50% Tesla 20 or 30% Microsoft 15 Google Amazon this me another number just keeping the so you want you lost etc is how you can build a simple Monte Carlos simulation just changing the wage of the assets in your portfolio and looking what is the value of the or your portfolio after two years. without buying or selling you introduce later strategy of buy and sell this is you can do as well the same using a different strategy with buy and sells and optimizing as well using wage the wage of the different stocks on the portfoli no but these are different topic let's keep simple you want to hold everything in your portfolio and then you just change the weight of every single asset in your portfolio and you observe after two years without saying or buying what is the value if you had the same weight of each of the four shares of your portfolio or you the changes after changing let's say or trying 10,000 different eh in portfolio obviously you can from even having everything in one single stock everything in Microsoft the rest zero this is one trajectory another trajectory is 25 each another trajectory is 50 30 15 etc you try all all of them are trajectories and giving you this the concept here is exactly the same what we are going to do sample trajectories of our agent that try to go from point A to point B and then on we will measure next we will try another we will measure and then on this

Segment 7 (30:00 - 35:00)

signal that is the requir at the end we calculated there is this is change the probabilities of every individual action in that particular state is how we are going to train so we made this trajectory got that reward and this trajectory had probability of action A 30% probability of action B 20% probability of action D 40% 40 60 90 and 10% so these actions at the end of the trajectory will be modified in our B propagation with an increment depending how good or bad was the reward that we got in that specific trajectory or episode. Ok. So now yeah here we just had a little bit how you can compare. is describ you a little bit before learning we learn this the Q value. So what is the best action pair to take in our specific tester? The policy as you can see is explain you implicit so it's derived from the value function so we just take the best using ARM max in every particular state the P action and then on the policy is derivated from all these better best Q values for every specific state in a trajectory works well with discrete actions in continuous spaces is quite complicated to use. we need to do discretization of the action space is quite complicated you can make it in fact we were using this Q learning algorithms antiQ in the previous videos with continuous spaces using C51 or quantial regression DQN or even rainbow in continuous spaces can work on deterministic and then on everything in Q learning is basic in Belman equation we already saw that as well and policy bases you can see here direct optimize of the policy to maximize spect and the policy is explicit and neural network outputs action probability remember this for next slides and hand continuous spaces and discret naturally is prepar can work with any kind of environments probabilistic policies and is the foundation as i mentioned you before of reinforce from human feedback and ppo is the reinforcement algorithm behind chart we will at the end of the course My plan is to to move into how to use reinforcement from human feedback and also PPO for training or fine tuning models. This is the plan. We see how it goes. Ok. Policy optimization objective. This is how the fundamentals of how this is work. Well, we want to maximize the expected trajectory ritual. Well, this is the formula. So, we had an expectation of the trajectory and then on we want to find these parameters. The average total reward we get when sampling trajectories for policy from a policy. No, what here what means expected no? Our policy is stocastic. So this means that the same state can lead to different actions. So let's see this example. So you are in this state and then on you can take different

Segment 8 (35:00 - 40:00)

parts. The environment is alsoastic. So in average so we average over all possible trajectories. So our function what we want to optimize is the sumatory of probabilities of trajectories multiplied by their rewards. The probability of this trajectory this parameters multiply by the require of this trajectory and this is the sumatory of all of them. No we suver trajectories probability of the trajectory multiplied for the return. So how we are going to do maximize this using gradient as 10 is simply the opposite of gradient simple that in fact we are going to use the same as we use in the in learning the same optimizers the adan optimizer in our neural network and we will just negate the loss in order to maximize which is the opposite that we usually do when we train a neural network try to minimize the loss in this case we do the opposite we try to maximize the loss exactly gladi is simply the opposite of gradient and then on you remember we had every state as its state we are going to compute the gradients with respect to the parameters we will multiply them by a learning rate and we will sum or rest this or substract to the parameters in our neural network and we will get a new parameter. During the fit forward we are going to calculate new probabilities of each state on that trajectory and then in the back forward propagation we will increase decrease these probabilities and we will modify the wage of our neural network the analogy what you have here so is we try to find starting from this point you can start from any of this point and then on you are here you can go here the objective is to get here in when you are training network you have the opposite you want to go to the valley here we peak no to the top of the mountain in when you are training for a classification problem for example you want to go to the valley here our objective is to go the maximum reward. That's why we are going to use the same approach in our in when we are training an agent with in policy optimization and remember that here we don't know the probabilities of the going to we are going to learn the probabilities but at the same time the dynamics of this environment are also dynamics Moreover they are unknown that's why this formula here is quite hard or is what we know as an intractable problem we cannot sort it out let's say how we are gna do this to solve it no here you had a little bit how this is in a very single state no we had probabilities this is for example after step for training in the state A we had the action one with 25% of probability the action two 75% of probability and then you have the dynamics this comes from the policy and the dynamics of the environment is 30% of the time you get plus one requir 70% of the time plus two after taking the action one and the dynamics of the man tell you take action two with 75% of probability 80% of the time you get plus two the require and the 20% of the time the times you get minus two so the spect require from that stage you can see is 0. 25 25 applying our formula here multiply by this is finit horizon calculation so that's why we are not applying the discounted factor so we had 0. 250 by 0. 3 multip by 1

Segment 9 (40:00 - 45:00)

require plus 0. 7 multip 2 plus 0. 75 this other branch 0. 8/ 2 0. 8- min this spect is in this particular step how much what is the require I should expect in this dynamics no obviously every state had their own values and then we calculate in during the training a sequence of state probabilities and this forward and compare with in the we compute the new rewards and we increment or decrement our gradi back in in the phase of propagation in the back propagation phase. So remember that this problem is quite hard to solve because we try when we compute the gradient involves essentially the probability of the trajectory which depend on both the policy and the unknown environment dynamics. We try to this problem itself is what we know as intractable but there is the good news that there is a way to sort it out. No if we look at the calculus theory. So the gradient of the log fx is equal to the gradient fx or the increment divided by fx itself. We can rearrange this and essentially once we rearrange we change effects by the equivalence in our formula. So the increment of our probabilities of this particular state so the this particular trajectory with this parameters is equal to the probability of the trajectories plus the gradient of the lock of the probability of that trajectory eh sorry with that ah let me say with that parameters so the trajectory probabilities are produ the turn pt tha by theta like state with multiply every policy the policy value of the state a state s taking the action a multiply by the probability of this particular action when we talk the logs of this just is what we are doing So in this case when we apply the increment to this turn of here as the inc the increment only depends the policy so only policy as depends of the parametres the rest PSO here and PSA these two parameters equ then we only have this turn after applying the log to the formula this means that the at the end the local PTO is going to be the formula that you have here the expectation which is the gradient going to be the sumatory of the gradient of p the policy the value of the policy multiply so this is the probability of the action

Segment 10 (45:00 - 50:00)

a at the state age multiply by the spector return so in here both things are already something that we can compute the gradient this is the gradient that we are calculating in our neural network during the training and we are going to back pathws the neural network and the returns we will calculate using the formula that we saw before. And then on here is the key. This is a model free. We don't need the environment because a model free reinforcementing. So mathematically using a trick the lock derivative trick we can move from an interctable problem into something that we can already calculate. Ok let me one second. Ok so next move we are going to see the foundation no the reinforce of the any of these algorithms in policy optimization which is the poly grading theorem is the exactly what we saw here how we are going to calculate the gradient is 1 div n is the number of trajectories that we sample sumatory of sumatory of any increments of any state action probability of action versus state for that state in that trajectory multiply for the individual return so you can check here every components this components is the direction to increase the probability this is the retle multipli reinforce these actions so dies so this multiplication is if the ret was very high so this value so this gradient is going to be bigger and viceversa it is negative or very small this increment is going to be smaller so depending of which return we get from that specific action at that particular state we will increase or decrease the the gradients and this back gradients will be back propagated to the wage of our neural network and then on here simply average the number of samplers to estimate the expectation so The intuition here is the high returns actions become more probable. It's obvious if we are back propagating bigger gradients for actions that get bigger return this will increase the probability of that action and that particular state because we are getting more probability. We will see this in a second. Low rate in actions become less probable and gradient points for towards policies that make good trajectories more likely. This is exactly the same that we had discussed in Python see later on the code at the end is the log probabilities of every single action on that trajectory multiply by the returns the individual returns so this so the probability multiply by the return we have two vectors, one with probabilities the other one with returns with sumat and we as I mention you before we negate this because we want to maximize the the loss in this case is exactly that we are maximizing the loss and then on after calculating we propagate the we with bathws we are going to autograd in our case compute the gradi optimize that optimizer we will propagate the gradients to our neural network and you can see here this is the

Segment 11 (50:00 - 55:00)

foundation of any of these including PPO and reinforcement and from human feedback and these other variants as well the one that we are going to see here today the reinforce that is basis in Monte Carlo A2C actor critic anti PPO is proximal policy optimization RPO is a variant with a confidence area. Ok so now we are going to see another key concept which is also central in any of these algorithms which is the advantage value and advantage functions. Remember we cover this on previous videos as well. All NDP market decision process and school learning policy optimizations are basis in a in different functions which we try to optimize. is the Q function which is given a policy the policy something that in this case we are learning how good is this action in that particular stage no then this is the expected this is exactly what we were calculating in Q learning algoritm what is the value of that action in that particular state and then we want to maximize the value of that particular action and then on we have here this value that we get from an action in a particular state we can divide in two components one component is the value state value the state value is telling me is telling giving me the information how good is to be in that particular state. So obviously when you are going from point A to point B it has more value to be in the previous state to point B previous to the end to be in the initial point initial state no why it's pretty obvious no one is a you have a long way to go the other one is you just get the goal is you are going to get the goal in the next state. The value to be in a state is different depending where how close you are to the final goal. And you have another component in the Q value which is the advantage is well advantage function with essentially the concept here is how good is to take this action in this particular state in comparison with the average the average means the how good is any action are available that particular we made the average and then on how good is this particular action in comparison with the average so this is the advantage function so the q value you can divide in value function and advantage functions and then on to calculate the advantage function you just take the q value and you remove the value This is what we calculate in Qarning and if you remember or you watch the videos in Q learning when we introduce one of the improvements for DQN which was the dwelling DQN which essentially try I don't know if I have it here So inelling DQN we essentially calculate the two components of the Q value with Y to understand ok this action in that particular state give me this value but I want to the two components how good was the state so give me a number which explain

Segment 12 (55:00 - 60:00)

me how good or bad was to be in that state and also give me the other component that is get me how good worst attraction in comparison with the average. We did it. Remember we just simply eh split the output of our er network when we were calculating the Q value in two different outputs and then we aggregated that so but the duelling DQN essentially has to which are one is the value which is a single value and the other one is the value of every single action in comparison with the average so we have two vectors as output in the dwelling and this is exactly the same that we are going to do here so calculate the Q value and this value will be essentially advantage will be the Q value minus the state val why advantage matter because as you can see and I will we will see in the next slide reinforce which is the algorithm that we will cover They use raw ret is the total value is a signal this signal is the one that we are going to propagate to our neural network but is working but is too much noisy so we are not knowing the action was good comparison with the rest we just know it's just a number can be big not big etc so if we instead to send the ron like we are doing in reinforce we send the advantage we are going to reinforce in a more positive way our gradients because we are removing all the necessary noise that we are including when we train a neural network using the reinforce algorithm we will cover this in a second and then on the advanced algorithms in policy optimization don't use the same calculations the same calculations to create the gradients because again rainforce as we are using raw returns, they are very heavy they are very low very noisy and they produce a lot of variance in order to reduce the variance and stabilize the training these algorithms all of them use the advantage function which is given a more clear signal so any case just an introduction we cover in detail in the next video when we will cover the variance in policy optimization. So how we can reduce the variance essentially with this the different variance of how we are going to calculate the gradients. This is what reinforce is doing. No, the total trajectory return which had essentially this is how the reinforce algorithm is formulated no in the papers. So the probability is going to be multiplied by the total reward that we had from of our trajectory from beginning to end. But what happen here? Let's imagine that you are in a step in a state on the middle of the trajectory. So we are going to add this sumatory as the whole require of the trajectory we use it to multiply by the logics of every single

Segment 13 (60:00 - 65:00)

action per action step in that trajectory. Again that produced previous to calculate the gradient. So what exactly saying here the past reward before the action was taken add noise without useful signal. So let's remember a trajectory. No, you had state one, state two, state three, state four, state six, state seven, bla bla, state 100. you have 100 states. No, this is your trajectory. And then on in reinforce we calculate this RT is the total reward of the 100 states. And then when we calculate the gradient for state one, we use the total reward. for state 50 the total reward but in the state 50 so the rewards that have that produce the 49 previous steps they are not well we shouldn't use them no they are just it's nothing that I produce I don't know if I explain my say I should care only how is impacting action from now forward no from this point forward so why i have to be impacted for what happen in the previous steps is the big problem of reinforce that we are overestimating because we use the total reward in the trajector but this is how the algorithm was formulated in policy grad I mention the beginning this is foundational algorithm that was model to solve the policy gradient algorithm and then it has some weaknesses. One of the ways to minimize or is the this problem is just use the technique regular to go. Reward to go is simply instead to use the total reward of the trajectory only use the rewards that happen from this state to the end of the episode. This way this gradient will be only impacted for what happen from this state to the end. Nothing that happens before. This way we solve the problems of the reinforce. This is what we're going to see in our demo. Then on there is another improvement that is baseline substructions. This is a you are familiar with time series modeling in a signal. You have a baseline, you have a train, you had a tendency seasonal component. it there is maybe there is no and you have noise and then what baseline substraction is doing is just remove the baseline component is like in the signal in a time series the common component and just left the noise plus the seasonal component plus the tendent or the train component this way So let's imagine that we are like having a baseline because we take a path. This is already embeded in this path. No. And then on remove from the reward obtained from this state to the end of the trajectory what is common or the baseline of these remaining steps of the trajectory. is another way to reinforce exactly what it was produced by this particular action in that particular state. continue using the rewards of the trajectory only what matter from that particular state and also we remove what is common for everything that happen every state that happen after this one that we are evaluating and the latest

Segment 14 (65:00 - 70:00)

improvement or variance is the what is using in most of the much modern or the modern algorithm to see PPO or GRPO is using the advantage eh function instead to use the total return we use the advantis function of that particular state no instead to use as multiplicator the requir we use the advantage of that action and that particular state remember that the function is how good is this action in comparison of the average of all the actions at a particular states and then on these four variants are how we are going to reduce the variance of the gradients when we are training on neural network As you can see this component is not changing is exactly the same. The only that change is how we wait this probability when we are back propagating this gradients in our neural network. So here this we will see it in the next video. not the most advanous ways to calculate advantage advance function they are also used on the most modern algorithms PPO for example and as I mention you PPO is the one that uh that use GPT to trade the model So now we are going to see our example with this reinforce in our particular example we are going to use returns with the reward to go. So this is kind of the this is the initial step we will use everything to the trajectory in G1 we will just use what matter after me and then on in the final state we will just use what is produced by me we will use why well again monte carlo and reinforce was the foundational algorithm in this in policy optimization so how works the entire episode to completion compute the actual rewards returns for real rewards and there is no bo trap no values estimat so pure rewards no approximations is very high variance so two similar states can produce a very different returns due to random is pure monte car this here the training on this you will see later is quite unstable in order to reduce the returns we are going to use normalization we will use regward to go and also normalization in order to keep the res in how to say to keep the reg normalize and not to introduce high steps on the during the training here we had the pseudo code of our algorithm sample and episode store l probabilities for that particular action world rewards bathws to use using reward to go trick eh normalize the returns byam by calculate the loss we propagate and we return so the loss is calculated to the we finish the episode. So here you had later I will get you all this information on the repository so you can leave it. just before we start how the policy learn where we are I exposed at the beginning how this neural network is exploring explotation no exploring explotation so at the

Segment 15 (70:00 - 75:00)

beginning let's imagine that the our neural network so we are going to use in this case the carp the Paul has for eh let me show you for eh the observation has for let me say you is a vector of four values this four values let me remember what is in carp the space is have is for document somewhere let me remember where one second is the position the angular speed the speed and the position of the platform I don't remember the four I will find it we will find it no and then only there is two actions no two action with left or right move to the pole to the right or to the left and so what our neural network is going to generate the output of our neural networ is a probability the logics in our neural network is going to be number of actions here this is our policy network it will be the input will be the observation for values and the output we moved into a hiden layer of 120 and the output is to number of actions in this case is two because car pole had only two action space no is a left or right move the pole to the right or to the left so this is the output of after the forward is going to be two numbers these are the the w you want to say that the right and the left movement no and then on we transform via shorts these logics into probabilities and then on we will create a distribution using these probabilities and then on from this distribution we will sample an action and then on with this action we will open the action to the lock spock which is the vector where we are saving all the individual actions that we are taking in this episode and then on once we eh here we are saving in the probability in the log prox list and the loss will be calculated by this is the list of all probabilities of every single action in our episode and this was the individual rewards that we get from every individual action that we have in this in this list are two vectors same size with probabilities the other one with returns we the up the multiplication and we negate we calculate the gradient and we update it how this works essentially So these numbers at the beginning will be practically random. Then non after some episodes after the bath propagation bathw propagation we will be modifying these values these probabilities at to we have the final ones that's why that's how this network is implementing via stockaticity of the of the own training using stocastic policy and modify the probabilities these probabilities obviously it at the end one of the number are predominant at the end of this is practically 0. 99 of one action and 0. 001 00 the other one

Segment 16 (75:00 - 80:00)

and then when we lock when we sample from that distribution the action that is always going to take is the one which had or is not having zero practically never you get zero in the other action but if you have a probability with 99% one of the actions and the other one only 1% obviously when you sample from that distribution you will be you will We get in 99% of the times the same action. That is how we are training our neural network. You can see here a little bit the progression at the beginning is equal on one of the actions predominate is similar to here reinforce and here is the equivalence to what we did in Qarning. In Qarning we use the epsilon grady. So we use a random number and an epsilon variable which is decing during the training. So at the beginning this epsilon is practically one or one and then on the random number a number in between zer one obviously it's going to be a lot of exploration and then on after the training and when different episodes the epsilon dec to some value below 0. 5 And then on every time every episode the epsilon is going to be smaller and then will be less exploration and more explotation. in here in reinforce or in the other algorithms we will be calculating these two probabilities are going to create a categorical distribution and we will sample from there at the beginning episode one both actions are practically the same. Episod 100 one action is more dominant than the other. Episode 300 this is practically 80% 85 15 and 95 near greedy here near random exploration here explotation so during training we are modifying these probabilities from where we sample the action in that particular state and this is what how we are training our neural network So you can see here and then on how we will breach everything to the next episode. Well, let's move to the code. Here we had in other we are following the same approach that we use on the on Q learning. We are going to create a benchmark for every individual algorithm and then we will record videos we will create to later on make comparison in of the performance of the of the different algorithms. We had in our class policy network that derivate from the module module from py torch here we just call we had the input this is the observation space in this case four action space which are two right or left and the hidden space forward we just get the logits from our output In our case let me remember let me find where I have this demo I don't find it because some point I think benchmar no here ok So here this is return this is how we are going to calculate the returns. In this case remember we are using the trick um reward to go. So we are going to calculate only the rewards from that point from that state up to the end of the trajectory. Then we had this function to calculate this. Here we run the policy gradient here we just create the configuration

Segment 17 (80:00 - 85:00)

we instantiate our environment we get the observation space shape number of actions. Here we set up instanciate our policy network an optimizer. Remember we are using the same that any conventional training our list episode requirs and then on here we enter in a from episode to the number of episode we defined we reset the environment we get this state this is the initial state and here we just initialize the lock probability and the rewards probability remember this is the de tu eh the two ups come on the two members of the two components or our loss function at to not on this is we are going to roll episode complete episode. So we convert our state into we get the logic so we pass this state through the network remember that the state is a vector of four values. We convert this into probabilities here. The actions that we the logies are the probability of length. probability of right we convert this probabilities in a categorical distribution and we sample an action from this categorical distribution remember that at the beginning the probabilities are practically equal and during the training we are going to modify these probabilities via gradient as so here we take this action and we essentially implement it or we run it on our environment we get necessary reward terminated truncate is the maximum number of steps by the incple is 500 terminate is we finish the environment which resolve or solve the environment so is so we will continue in this environment either to we either finish or solve the environment or we get the maximum number of steps We lock our probability into locks probability. We lock the rewards in our rewards and then we move to the next step. Here we complete a episode and here we calculate the returns with the reward to go trick and then on we convert the returns into a tensor. normalize the rets into into we normalize the returns and here we calculate the loss as you can see here negative logs multiply by return we sum up all this normalized loss and we calculate the gradient in the gradi calcul and propag here we just add the episode rewards and here we just print out here you want to record the video that we will use and we finish the environment let's try this on code Python policy gradient benchmark record the video number of episodes to record number of episodes our ghma for discount the discount to apply and the learning rate our neural network You will see it will start to train. I will stop here and then when it finish I will come back. Ok

Segment 18 (85:00 - 87:00)

that's done. You can see here the even with the to go and with the standardization the training is quite instable. Nothing. You probably remember if you watch the rainbow the advan we got rewards of 500 with rainbow or no not 500 but 490 the maximum rewarding in this environment is 500 with rainbow practically with we touch the top and with rainforce you can see is very poor. Let me see where I have the videos. So we have here the videos you can see what we had produce. Nothing spectacular quite po the policy that we learn. But is trying to keep the the pole vertical with this. Ok, that's pretty much all we will go next video. We will cover the other algorithms pure policy optimization these three variants and well I wait for you in the next video. Thank you very much for watching the video to this point. I hope that you find it interesting, useful for your job, for your training, for whatever reason you're watching this video. And so give me a like subscribe to the channel if you are not already subscribed consider member of the channel. this way you continue with this project and also you can help me with super thous you com your comments are polite is not there against channel policy and also you can share the video with friends and colleagues and this way you this YouTube algorithm my content comparison with other content YouTube. That's all. Thank you very much and see you in the next

Другие видео автора — The Machine Learning Engineer

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник