633 views|1 replies

445

Posts

0

Resources
The OP
 

"Deep Reinforcement Learning in Action" Reading Deep Q Network [Copy link]

This post was last edited by dirty on 2023-11-18 23:24

In this article, we will learn about deep Q networks. Please take notes below to understand them.

In order to introduce the deep Q neural network, we start with the example of the grid game (Gridworld). The following concepts are defined.

State: is the information the agent receives to decide what action to take.

Strategy: is the strategy that the agent follows when it receives a state

Reward : It is the feedback that the agent gets after taking an action, which will produce a new state.

The weighted sum of the rewards obtained by following a certain policy starting from the starting state s1 is called the state-value , which is represented by the value function as follows - the function receives an initial state and returns the expected total reward.

where coefficients w1, w2, etc. are weights given to the rewards before summing.

The main idea of Q-learning is that the algorithm predicts the value of a state-action pair, then compares that prediction with the cumulative reward observed later and updates the algorithm's parameters so that it makes better predictions next time.

The Q-learning algorithm rules are described as follows

The parameters γ and α are called hyperparameters, they affect how the algorithm learns but are not involved in the actual learning.

The parameter α is the learning rate, a hyperparameter used in training many machine learning algorithms. It controls how fast we want the algorithm to learn from each step: a small value means that the algorithm will only make small updates at each step, while a large value means that the algorithm may make larger updates.

The parameter γ is called the discount factor, which is a variable between 0 and 1 that controls the degree to which the agent discounts future reward values when making decisions.

The main problem we encounter when training models in random mode has a name: catastrophic forgetting, the idea is that when two game states are very similar but lead to very different outcomes, the Q-function will get "confused" and fail to learn what to do.

In the random model, we need to consider the problem of catastrophic forgetting, which is why we need to implement experience replay . Experience replay basically allows batch updates in online learning schemes.

Learning instability, the idea is that since rewards may be sparse, updating each step may cause the algorithm to start to behave irregularly. The idea is to update the main Q network function at each training iteration, reducing the influence of the most recent update on the action selection, in order to improve stability.

The target network is just a lagged copy of the master DNQ and can be used to stabilize the update rule for training the master DNQ .

summary:

This chapter mainly explains the components of the Q-learning algorithm formula and their meanings, as well as the problems encountered in model training and how to resolve them. The code involved is commented, and you can read along to understand the process reference model algorithm.

This post is from Embedded System

Latest reply

Thank you for your hard work. Thank you for sharing such good technology.   Details Published on 2023-11-19 07:58

725

Posts

4

Resources
2
 

Thank you for your hard work. Thank you for sharing such good technology.

This post is from Embedded System
 
 

Guess Your Favourite
Just looking around
Find a datasheet?

EEWorld Datasheet Technical Support

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京B2-20211791 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号
快速回复 返回顶部 Return list