Chapter 2 of "Deep Reinforcement Learning in Action": Greedy Strategy and Softmax Selection Strategy
[Copy link]
(Last week, I struggled to read, write, and post, but recently I seem to have been infected again. My nose and throat are uncomfortable. I feel cold and dizzy as soon as I sit at the table. I use at least half a basket of nasal tissue every day. I couldn’t sleep all night recently, and my hair fell out in clumps. I have lowered my efficiency and I am really powerless. There are no nucleic acid tests and test kits these days. I don’t know if it is a relic of three years or the recent hot respiratory diseases. I felt better this afternoon and cleaned the house. Looking at the dustpan, I am afraid that 1/5 of the hair has fallen out in the past two weeks. Only those who are uncomfortable know it. I sincerely hope that colleagues will not take it lightly. / Miserable)
In the following, I usually call the "action" "choice", and the "value" is usually called "expected value/anticipated value", "return value/income value", etc.
A "neural network" is a machine learning model composed of multiple "layers" that perform matrix-vector multiplications and then apply a nonlinear "activation" function, ReLU, or Rectified Linear Unit. The matrices of a neural network are the learnable parameters of the model, often referred to as the "weights" of the neural network.
Anything that can be called a line graph must be able to be viewed in detail at any level of abstraction and remain type-compatible (meaning that the data types entering and exiting the process must be compatible and sensible - a process that produces an ordered list should not be connected to another process that expects integers as input).
2.1 Greedy Strategy
By setting an expected value, when the reward value of the selected action is lower than the expected value, one of the options is randomly selected and selected, and finally the mean value of each option is obtained:
k: number of selections
u: average return value
x: the return value of the latest selection
By continuously selecting, we can eventually obtain a set of average reward values of all options, from which we can finally select the option with the highest average reward value. However, its disadvantage is that the convergence speed or efficiency is low, because the probability of all options being selected is equal, and the strategy will repeatedly select options with lower average return values, which reduces the evolution efficiency.
2.2 Softmax Selection Strategy
Compared with the greedy strategy, the Softmax selection strategy introduces a weight factor. By adjusting the weight coefficient, the subsequent choices can fall on options with relatively high average reward values compared to the previous choices as much as possible, because they will be given a higher selection probability (called Softmax probability) during evolution, thereby greatly improving the convergence speed.
But according to:
Pr: Choice-return value vector
Q k : Choice-expected value function
: Distribution axis scaling parameters
Due to the existence of
, the Softmax selection strategy needs to be modified and adjusted through human intervention to obtain better differences in Pr, which determines the limitation that a Softmax selection strategy is only applicable to a specific model scenario.
|