On-policy, its process is the same as the above strategic gradient. Practical analysis

it先森

On-policy, its process is the same as the above strategic gradient. Practical analysis [Copy link]

This article mainly summarizes the part about strategic iteration, focusing on explaining the principle. Li Hongyi’s video can be found online.

Finally, let’s explain the part about PPO, OpenAI’s default reinforcement learning algorithm. (Proximal Policy Optimization)

The blue mark indicates the specific code to be consulted. Different from the simple understanding and expression of value iteration in reinforcement learning, strategy iteration requires more patience, carefulness, and consideration.

Optimization policy

The strategy can be represented by a neural network with training parameters . The strategy network inputs the current observation value observation (state) and outputs the probability of action.

The output action of the network, if it is continuous, is represented by one neuron for each action, and the size can be a continuous value. If the action is a discrete behavior, each neuron outputs a probability. The probability represents the possibility that the behavior may be taken. In the actual behavior process, sampling is performed from the probability.

The figure below shows the probability of the three discrete actions output when the input condition S is given.

In reinforcement learning, the goal of the agent is to maximize the cumulative reward R in the MDP (Markov decision process) model. As follows: In the running process of each episode, maximize the cumulative reward R, which is the sum of all r.

In a Markov chain, each track is represented by . _ () represents the probability of the track appearing. (_1) is the probability of _1 appearing at the beginning. _ (_1 |_1) and so on are the probabilities of choosing _1 under _1. The subsequent (_2 |_1,_1) and so on are similar, representing the probability of reaching _2 under _1,_1 (transfer probability).

In each orbit , for each pair of s and a, a reward r is generated. Finally, in each orbit, a cumulative reward R is generated.

For the same strategic model , the environment of each interaction, each behavior, etc. are uncertain, and there will be different trajectories in the end.

However, the trajectories are all obtained by the strategic model and obtain different cumulative rewards R. The optimization strategy of reinforcement learning is to optimize the model to maximize the uniform cumulative (hope reward) reward.

In the following expression, the uniform reward (reward hope) is the product of the cumulative reward that occurs under each track and the probability _ () of the occurrence of the track.

In the expression of hope, under probability _ (), the sampled trajectory is rewarded uniformly under all trajectories .

Policy Gradient

The optimization goal is to maximize the reward expectation. Then we directly use the gradient method to find the gradient and use the gradient ascent method to get the maximum expected reward.

In essence, the gradient of reward R is calculated, and the independent variable is . is the parameter of the strategic model (the weight of the neural network, the training parameter), which means that after training the model , the maximum expected reward R is finally obtained.

By calling the formula in the blue box, we get the second row, and by changing the desired expression, we get the third row on the left.

After executing enough N orbits, first of all, the N orbits are still determined by in the strategy . Under enough N, it basically obeys the orbital distribution in the desired index ~_ (), so it can be approximately equal.

The fourth row further deduces the third row on the right, opening the timeline of _. But is there a missing transport probability?

However, it is obvious that the probability of transport in the track is determined by the environment Environment, so using as the independent variable to find the gradient, which is a constant, does not affect the final gradient direction.

Therefore, the strategic gradient exercise is as follows: The gradient has been calculated in the figure above.

In the actual implementation process, such as using TensorFlow to calculate gradients, its calculation process may be related to cross entropy.

Completion and revision of strategic gradient

The strategic gradient calculation process:

1 Based on the interaction between the agent and the environment, a track is obtained through the strategic model . After the track is completed, the cumulative reward R is obtained.

2 Use the gradient ascent method to maximize R. In the calculation process of gradient ascent, _ is actually the probability of the output of the strategic network , which is actually gradient ascent, changing , and then maximizing R.

Repeat the above process until R becomes larger and larger.

There are two problems found in the application process of strategic gradient:

1 In some reinforcement learning environments, all behavioral rewards under the track are positive reward values. According to the formula, each time is a gradient increase.

Then, in the process of maximizing R, that is, when the strategic network is trained, the influence of rar cracker sampling causes that for the same observation (state), the output behavior probability distribution is more inclined to train in places with more samples.

In the same situation, the sum of the probabilities of the three actions abc is 1, and the ideal situation is that action b is the best. However, a is sampled more times and b is sampled less times, so according to the gradient rise, R is greater than 0, and the strategic network strengthens the action intensity of a, resulting in a reduction relative to the action intensity of b. Similarly, in the second row of the figure, when a is not sampled a certain number of times, a will be reduced relative to the ideal situation.

Therefore, in the application process, add a baseline to R so that R can present negative values, and only add the track strategy corresponding to R that is higher than the average reward. If a negative value appears, it means that the track is not good enough and should be reduced. The more samples are taken, the more it will be reduced.

2. The measurement problem of reward R. Under the strategic network , the cumulative reward R of the entire track is not enough to measure the quality of the probability of each output behavior in . Each output should correspond to each reward, not the entire track R.

In the track, the current behavior a determines the subsequent state-behavior pairs to a certain extent. The later the state-behavior pairs are, the smaller their decisive influence is. Therefore, the reward r of the current behavior a is the sum of the behavior and all subsequent behaviors. And the attenuation factor is used to simulate the smaller the decisive influence is, the smaller the reward of the current behavior a is.

Importance Sampling

Importance sampling means that if we cannot know the probability of f(x) in the distribution p, we can use an arbitrary distribution q to indirectly obtain the probability of f(x) in the distribution p.

The specific operation is as in the second line. In the end, the probability of f(x) on distribution p is transformed into the probability of f(x)*p(x)/q(x) on distribution q.

Although the expectations are equal, the variances are not equal. In the figure below, the first line is importance sampling, the third line is variance expansion to expectation mode, and the fourth line uses importance sampling again in the former and directly replaces it in the latter.

The conclusion is that if the ratio of the unknown distribution p and the known distribution q used before and after importance sampling is 1 (exactly the same), then the variance is also the same. The more biased the ratio, the more biased the variance.

The problem caused by the more biased variance is that when the number of sampling times is not enough, the deviation of the expected values before and after the importance sampling theorem will be more likely to occur.

The application is as follows. The expectation of the former on the p distribution, the cumulative sum of f(x)*p(x), is basically negative. For the latter, when there are too few sampling points, since q(x) is concentrated in the positive part of f(x), although p(x)*f(x)/q(x) is also very small, its value is positive. Only when sampling to another point, p(x)*f(x)/q(x) corresponds to a large value, will the expectation be negative. This makes the expectation consistency before and after the importance sampling theorem.

Online strategy to offline strategy

On-policy, the process is the same as the above strategic gradient: the strategic model interacts with the environment to obtain a trajectory, and then uses the trajectory for training. Repeat the process.

Off-policy, its strategic model ' interacts with the environment to obtain a trajectory, and the strategic model learns the trajectory, while the strategic model itself does not interact with the environment.

In the online strategy, the learned track is directly discarded and cannot be learned again. This is because after the model learns its own track, the model has changed, and the previous track is no longer generated by the changed model .

In the offline strategy, it uses other models ' to obtain multiple tracks for learning the strategic model . After learning the subsequent tracks, since ' has not changed, the generated tracks are still available.

(The goal of learning is supposed to be a fixed distribution, not a changing distribution.) Compared with online strategies, offline strategies are more efficient due to the convenience of track generation and learning process.

′ corresponds to the model ', and the importance sampling theorem is used to transform the online strategy into the offline strategy:

The first line is the online strategy (a rough representation of the previous strategy gradient, obtaining the track from ), and the second line uses sampling from ' to obtain the track and calculate the gradient.

After multiple samplings, the gradients of the two should be flat.

The following is a specific expression of the strategy gradient, and the transformation is performed using the importance sampling theorem. The first line is the online strategy, and the second line is the offline strategy. Since the reward part A of the offline strategy is generated by ', it becomes ^(′ ). After the change, can the formula still be the same as before? No, it can't be. If it can, continue to derive

In the third line, _ (_,_) represents the probability of _ and _ appearing. After the probability logic, it is converted into two terms. The upper and lower terms of the latter are assumed to be equal, and the ratio is 1. In fact, they should be unequal. However, it is not easy to obtain, so it is assumed to be equal.

The last line, through the blue box formula in the figure, reversely deduces f(x), which is the optimized target function J for rar cracking under the offline strategy (in fact, it is similar to the target function R under the original state).

PPO vs TRPO

The TRPO (Trust Region Policy Optimization) algorithm was proposed before the Proximal Policy Optimization (PPO) algorithm.

In TRPO, the method actually used is the online-to-offline strategy, and additionally requires that ,′ have a similarity (KL divergence, the similarity between the original distribution and the approximate distribution of the data)

In PPO, similarity is written into the optimization target. The optimization target is required to be as large as possible, and the KL divergence is as small as possible.

If the KL divergence is too small, the similarity between ' and is high, and perhaps nothing can be learned? If the similarity is too low, due to the problem of importance sampling, the variance effect and the insufficient sampling amount, the previous and next equations do not hold.

Here ^ indicates that there are multiple 's used to generate orbitals.

Compared with PPO2, its essence is still to control the similarity of ,′. However, instead of using KL divergence, the clip method is used to limit the ratio of the two within a range. Make A greater than 0 to strengthen the strategy, or A<0 to limit the strength of the strategy, all within a range.

Le vent se lève! . . . il faut tenter de vivre!

Le vent se lève! . . . il faut tenter de vivre!

This article is reproduced from: https://blog.csdn.net/dafengit/article/details/106073709 Please indicate the source when reprinting