Reinforcement learning algorithms can autonomously explore and obtain samples, but due to the randomness of the strategy in the early stage of training and the large number of invalid samples, the contribution of successful samples to the change of weights of the neural network is easily overwhelmed, resulting in low sample utilization and even failure to converge. For fixed positions and starting positions with different heading angles, the control sequence of manual parking is used to pre-train the agent (the terms such as vehicle, agent, and algorithm model in this article have the same meaning in the context) so that the agent can obtain samples with high return values without exploration in the early stage. The failed and successful exploration experiences are stored separately, and the sampling ratio that changes with the number of training rounds is set so that the agent can always learn from successful samples. The Monte Carlo tree search method in AlphaGo is used to generate parking data, and the reward function is used to evaluate the data quality to select the optimal data for training the agent, avoiding the impact of low-quality data on the agent during random exploration. The TD error is used as the priority of the sample, and the SumTree data structure is used to store the samples. Based on the priority sampling, the samples that contribute more to the gradient calculation are more likely to be sampled. When studying the decision-making control of high-speed intelligent driving of vehicles, the exploration strategy is divided into lane keeping exploration strategy and overtaking and obstacle avoidance exploration strategy. On the basis of the original action, the correction value based on the improved strategy is added to reduce invalid exploration.
In reinforcement learning, the design of the reward function is directly related to whether the model can converge. For robot path planning, a reward function that only includes collision penalties and reaching the end point is designed, which belongs to the problem of sparse rewards. The agent is trained with sparse rewards and dense rewards respectively. The results show that the agent trained with dense rewards has a higher parking success rate. For the path planning problem of indoor mobile robots, the reward is designed to exceed the specified time by -0.05 to prevent the robot from staying in place because of timidity. The reward is set to the negative of the distance between the current position of the vehicle and the target position. While guiding the vehicle to approach the target position, it urges the agent to arrive as soon as possible. In addition, some scholars have improved the convergence of deep reinforcement learning algorithms from the perspective of improving training methods. Based on course learning, the convergence is accelerated by gradually adding obstacles to the training method. Course learning was proposed by the leader of machine learning . Its essence is to set a series of courses from easy to difficult for the model based on prior knowledge to accelerate the convergence speed. The fixed heading angle discretization training method is adopted. The working condition with a heading angle of 30° is trained first, and then gradually expanded to the initial heading angle of 0°~90° after convergence. This coincides with the idea of learning the course from easy to difficult.
In summary, the automatic parking path planning algorithm based on deep reinforcement learning still has some shortcomings. During the training process, the learning efficiency of the agent is not high and the convergence speed is slow. Reinforcement learning requires the agent to interact with the environment based on the current strategy to obtain the samples required for learning, and the quality of the samples will affect the strategy update. The two are interdependent, and the algorithm is prone to fall into local optimality. Compared with robots, cars are non-complete systems with lateral and longitudinal coupling , and the parking space is small. For given initial conditions, the parking path and control sequence are very sparse. In order to reduce the difficulty of learning, the common method is to fix the starting posture training and relax the parking space restrictions, but this also leads to the training of the agent. Compared with the traditional planning method, the planning ability is not strong, and it cannot meet the actual application requirements of automatic parking. If the above shortcomings can be effectively improved, it will have a positive role in promoting the automatic parking method based on deep reinforcement learning. First, the automatic parking motion planning method based on deep reinforcement learning is introduced. Improvements are made on the basis of considering convergence and stability. The agent is trained by building a simulation platform, and then the performance of the agent is analyzed and evaluated from multiple angles such as robustness, planning ability, and safety.
1 Establishing vehicle dynamics model
1.1 Deep reinforcement learning algorithm model
Reinforcement learning is a Markov decision process. Based on the current state s, the agent selects action a, the environment returns reward r and the next state s′. Through continuous attempts, the agent learns the optimal strategy. The deep deterministic policy gradient (DDPG) based on the Ator-Critic framework, on the basis of deterministic policy gradient (DPG), combines the advantages of the DQN (deep q-learning) algorithm, including dual networks, experience replay pool, etc., and has achieved good results on many problems.
In traditional reinforcement learning, value-based methods use tables to record all action values, but for continuous state spaces, the number of states is huge, and using table methods will lead to dimensionality disaster. Therefore, a neural network is used to approximate the action value Qπ(s, a):
Where w represents the weight of the neural network.
Similarly, when parking, the vehicle has very high control accuracy requirements, which cannot be met by discrete actions. Therefore, the strategy is also approximated by a neural network (actor network), as shown in equation (2). Ornstein-Uhlenbeck noise (OU noise) is added to the network output to increase the exploratory nature in the early stages of training. This equation describes a mapping from state space to action space, and outputs the best action given a certain state.
In DQN, the loss function of the critic network is defined as follows:
From formula (3), we can see that the gradient update of the critic network depends on the action value calculated by the actor network and the target Q value calculated by its own network, while the gradient update of the actor network depends on the Q value calculated by the critic network. The correlation between the two networks and between the target value and the current value is too strong, which leads to the instability of the algorithm. In order to reduce this correlation, a copy is created for the actor network and the critic network, namely the target critic network and the target actor network, to calculate the target action and target value. The improved current critic network and current actor network loss function are shown in formula (4), where the current critic network gradient update is changed to depend on the action value calculated by the target actor network and the target Q value calculated by the target critic network.
The target network is updated by slowly tracking the current network (soft update), as shown in equation (5), where α is the current network weight and α′ is the target network weight. The target value can be regarded as constant in the short term, similar to the sample label in supervised learning, which greatly improves the stability of learning.
1.2 Vehicle kinematic model
In the low-speed parking condition, the side slip of the tire is not considered, and the nonlinear state space model of the vehicle is shown in Equation (6). Where x and y are the coordinates of the center of the vehicle's rear axle, the heading angle θ is the angle between the vehicle's longitudinal axis and the x-axis, (x, y, θ) is the vehicle's posture; v is the linear velocity of the midpoint of the vehicle's rear axle, L is the wheelbase, and δ is the front wheel turning angle.
1.3 Definition of reinforcement learning elements
The state of the vehicle at a certain moment needs to be clearly distinguishable and able to characterize the relationship between the vehicle and the environment, preferably related to the control quantity. The vehicle's x, y coordinates, heading angle θ, steering wheel angle sw, and the minimum distance d between the vehicle body and surrounding obstacles are selected as the state:
When parking, lateral control is more important than longitudinal control, so the longitudinal speed is set to a constant value. Action a is defined as the target steering wheel angle sw target at the next moment, and its range is [-540, 540]. At the same time, in order to ensure sports comfort and avoid excessive changes in the steering wheel angle, the steering wheel angle change is limited to 20 (°) / Δt when input into the vehicle kinematic model:
Based on the Pytorch framework, the algorithm block diagram of the automatic parking simulation platform was established, as shown in Figure 1. The entire framework consists of three parts: the first is the interaction process between the agent and the environment. The agent (that is, the vehicle) determines the target steering wheel angle a according to the current vehicle state s, adds OU noise and inputs it into the vehicle kinematic model, calculates and returns the next state s′ to the agent, and repeats until the vehicle collides with an obstacle or completes parking. The second is the process of storing samples in the experience pool. After each interaction between the agent and the environment, the reward function calculates the reward r according to the next state s, and then stores the (s, a, r, s′) tuple in the experience pool. The third is the training process of the agent. A batch of data is randomly sampled from the experience pool, the loss function of the current critic network and the current actor network is calculated and stochastic gradient descent is performed, and the parameters of the target critic network and the target actor network are updated by soft update.
Previous article:Analysis of the research and application of commercial vehicle drive-by-wire chassis technology
Next article:Six technical routes for in-vehicle gesture interaction research
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Division of the electromagnetic spectrum
- Why is the DCDC step-down chip easily damaged?
- Setting trial time limit
- Migrate from M2M to IIoT
- Brushless drive solution for circuit board
- 【Showing goods】The second wave
- EEWORLD University ---- HVI Series: Mastering the Art and Fundamentals of High Voltage Gate Driver Design
- How to enable TMU of 28377d.
- "【TGF4042 Signal Generator】" Potential Compensation Test
- MCU Programming_Interrupt