Thinking about how autonomous driving will move forward from ChatGPT
Self-driving cars are expected to revolutionize multiple industries, including the transportation of people and goods.
In fact, the development of L4+ autonomous vehicle systems is a major challenge.
Today, the main bottleneck is the vehicle's ability to safely handle the "long tail effect" of driving events (that is, unsafe behaviors that may occur in many rare situations encountered on the road). In fact, this is not the case in the development process of world-class autonomous driving. It has not been fully realized.
The research and development ideas of ChatGPT released on November 30, 2022, have brought us the overall idea of advanced development.
The core technology of ChatGPT is the new generative AI technology Transformer that emerged in 2017 and the subsequent large-scale pre-training technology based on Transformer, and introduced some new reinforcement learning algorithms. Have good time series data processing capabilities (can handle contextual grammatical relationships well). chatGPT can be described as a "language model optimized for conversation", which interacts in a conversational manner. ChatGPT is the most comprehensive generative AI conversational robot currently released in the world. Its basic logic is to achieve dialogue generation that conforms to basic logic, generate a large amount of high-quality text content in a short time, and give relatively accurate answers and self-correction. It is characterized by effective optimization based on dialogue and good memory. Ability to complete continuous dialogue.
The technical idea of ChatGPT is consistent with the idea of autonomous driving cognitive decision-making: that is, strengthening learning from human feedback to improve the efficiency of the algorithm in stably outputting the optimal solution. The ultimate goal is to enable the autonomous driving system to learn the excellent driving methods of experienced drivers. This requires continuous input of human feedback information from the large cognitive model, and the autonomous driving system needs to learn to select and distinguish, and stably output the optimal untie.
Therefore, this article will use the idea of chatGPT to explain how intelligent driving will evolve methods and strategies more efficiently.
Differences in strategies between chatGPT and intelligent driving system development
The above information can be summarized into the generation process of "self-driving vehicle behavior prediction ability" during the development process of the autonomous driving system. Because the overall understanding process of autonomous driving is to ensure the safety of the car in complex traffic scenes and to efficiently and accurately predict the uncertain behavior of surrounding objects.
Simply put, ChatGPT = Transformer model + large-scale pre-training technology + reinforcement learning with human feedback (RLHF). What needs to be noted here is that since chatGPT is just a learning robot, the answers to the questions it learns can be all kinds of strange, and some answers may also contain wrong information. At this time, during the learning and updating process, it is possible to obtain less than ideal answers. In the development process of autonomous driving systems, this type of learning feedback and output requires very accurate "answer" output, because wrong control strategies may lead to huge errors in the entire vehicle control, ultimately causing driving safety problems. In other words, the autonomous driving system pursues almost zero error rate in learning answers, and its tolerance is also very low.
Here we first list and compare the differences between the implementation principles of chatGPT and the autonomous driving learning solution, and then explain the corresponding implementation differences from the perspectives of perception and control.
|
chatGPT basic algorithm-RLHF |
Autonomous driving machine learning algorithm—ML reinforcement learning |
frame shelf picture
|
The reward model (RM) and the policy model (policy) can be updated iteratively, so that the reward model can describe the model output quality more accurately, and the output of the policy model can further widen the gap with the initial model, making the output text become more and more precise. The more consistent it is with people’s cognition. |
The neural network architecture of the proposed ML planning model is inspired by VectorNet. The vectorized information for each agent and map element is encoded by the PointNet network. This local information is combined into a global embedding by Transformer. The embeddings are then converted into actions via a kinematics decoder.
|
train supervise Model
|
Mass annotation: Human training engineers sample random requests and compose expected responses; Supervised training: fine-tuning pre-trained models (such as GPT-3.5), request (Prompt) -> answer (manual); Main goal: to have certain conversation skills; |
Model architecture. The training model is built on the architecture of a hierarchical graph network, which consists of a PointNet subgraph based on point networks to process local information from vectorized inputs. A Transformer encoder is also used as a global graph for inferring interactions on agents and mapping features. |
train Preference Model |
Mass annotation: Random requests are sampled, and human trainers rank the quality of the multiple candidate answers generated; Partial order training: fine-tuning the pre-trained model (such as GPT-3.5), request (Prompt) + answer -> score (floating point number); Distillation preferences: human feedback; |
Training framework. Intelligent driving systems train a driving strategy by using imitation learning. This driving strategy mimics professional driving behavior by minimizing the loss function between the model-generated pose and the ground-truth pose. The distribution of states seen during training can then be expanded by adding arbitrary perturbations, thereby reducing the effects of covariate shifts. By using a pre-solver to smooth the target trajectory after applying the perturbation, using the kinematic decoder we can skip it. Instead, we can simply penalize large values of bumps and curvature to reduce bumps and improve driving comfort. |
train strengthen Model |
The reinforcement learning PPO algorithm is used to initialize the supervised model to maximize the feedback reward of the preference model. |
Fallback Layer After generating a machine learning trajectory, the intelligent driving system will evaluate its dynamic feasibility, legality and collision probability from multiple dimensions, and determine the trajectory identification. They mainly include: dynamic flexibility, legitimacy, generating feedback trajectories. This type of selection process actually makes preference settings based on the driver's driving feedback. |
right Compare total Knot |
Same/similar points:
1. Data processing mode: The reinforcement learning with human feedback (RLHF) algorithm is introduced into the basic training algorithm of chatGPT. That is, in the reinforcement learning stage, by fitting a large amount of manually labeled preference data, the large-scale language model and human preferences are aligned to provide satisfactory results. , reliable, harmless answer. This is exactly the same as the data closed loop of the autonomous driving system. That is, a large amount of human driving data is collected through test vehicles during the development phase and instilled into the machine learning model. 2. Learning mode: 对于chatGPT和自动驾驶系统(无论感知还是规控)来说都需要进行监督学习和强化学习两个阶段,且该两个阶段都涉及训练监督模型、偏好模型和强化模型,只是在训练的方式和要求上存在一定差异,下文可以针对自动驾驶系统的训练方式做详细说明。 3、标注模式: 当前自动驾驶系统的常规做法是通过人工标注来实现数据训练。在这一步,由于牵涉到大量的数据训练,采取人工标注显然是较为繁琐,且效率低下的。当前的RLHF在chatGPT中仍然存在较多数据标注这类常规的数据处理模式。都涉及较大的工作量,唯一不同的是RLHF后续可以做演进提升,采用RLAIF的算法去掉大量人工标注,使用“宪法”+AI自动生成标签。 不同点: 自动驾驶领域对于模型训练的容错容忍度相对于chatGPT来说几乎是不一样的。因为自动驾驶领域讲求的是功能安全、信息安全高要求,因为学习错误的一次就可能造成不可预估的后果。 |
基于Transformer模型在ChatGPT的应用,意味着需要自动驾驶感知的机器视觉能够完整的理解上下文联系的算法模型,并通过用大规模无标注数据训练的通用语言模式进行有效训练,这个过程对于采集样本的种类、数量和分类结果就有很高的要求。通常只有样本数据十分全面、多样才能够将原始数据训练模型练得更加符合预期。最后,再通过人工标注(达到一定水平的也可以用机器标注的方式)筛选出模型最优解。
如上这一过程在自动驾驶系统中可以解释成是在车端或云端进行数据闭环处理的必要手段。因为在智驾领域的环境感知能力更多的是面向更多更大的数据处理模式,这样的方式应用自动驾驶(autonomous driving)以及车联网(Internet of Vehicles,IoV)等技术,使得传统上完全人为控制的机动车辆具备智能处理的能力,包括但不限于智能数据采集、智能分析、智能决策等。chatGPT的高效多数据处理模式(包括监督学习、强化学习以及模型训练等)所带来的学习机制可以使智能化大数据处理技术实现了针对车辆本身、外界环境、交互控制等多维度海量数据的高效处理与分析。
由chatGPT具体训练过程可以得到如下启发。自动驾驶领域实际也有机器学习规划。最近,由于深度学习的成功,基于机器视觉的强化学习规划受到了关注。这种方法的优点是避免了手工制定的规则,并且可以很好地扩展数据,因此随着更多数据被采集来用于训练,性能会越来越好。
因此,这种方法具有处理各种驾驶情况的巨大潜力。如下描述的方案均可作为自动驾驶比较经典的学习方式。
1、模仿学习(IL)
IL 是一种监督学习方法,其中训练模型以模仿专家行为为主。IL在自动驾驶中的首次应用是1989年开创性的ALVINN,它将传感器数据映射到转向并执行乡村道路跟踪。最近,有些自动驾驶研发机构也有单独使用多摄像头输入的端到端驾驶,但真实世界的驾驶结果仅限于简单的任务,例如车道跟随或交通畅通的城市驾驶。行业内也有一些研究提出将IL应用于场景的鸟瞰图,并使用合成扰动来缓解协变量偏移问题,但它尚未在现实世界的城市环境中进行测试。
2、增强学习RL
强化学习 (RL)非常适合自动驾驶的顺序决策过程,因为它处理代理与环境之间的交互,且结合了学习和基于规则的组件,实现有效模拟真实驾驶员动作。已经一些学术机构已提出了几种方法并将强化学习RL应用于自动驾驶。另一方面,逆向强化学习 (IRL)是另一种流行的应用于自动驾驶的机器学习范式,它根据专家演示和环境模型推断出潜在的奖励函数。对于自动驾驶开发前端研究而言,这些源自于chatGPT的研究思路都是很好的将自动驾驶用于现实世界的有效手段。
当然,如果想要如上提到的机器学习很好的应用于现实世界并大规模部署,则需要提出一定的安全网络来减轻上面介绍的机器学习规划方法虽然很有前途,但不提供安全保证,这阻碍了它们在现实世界中的大规模部署。我们受到这种范式的启发,但旨在通过本文提出的 SafetyNet 来减轻这种限制。
3、混合方法
机器学习和传统运动规划技术的结合主要分为两类:基于机器视觉的启发式方法,可用于改进传统的规划算法,例如在加速方面的能力。模块化方法,主要是利用专家计划者来生成候选轨迹。又如,通过根据基于机器学习设置成本量来评估轨迹,同时通过提供基于对导致潜在碰撞的轨迹施加非常高的成本来对驾驶安全进行保证。然而,这些安全保证并未在现实世界中得到验证。
另一种强化学习领域研究是对一个特定研究领域构建安全框架。但我们的目标又不是提出一个全面的安全框架,而是一种简单而有效的方法,允许部署一个强大的神经网络规划器,该规划器可以在确保某些安全性和合法性约束的同时学习和改进数据。
规控引导篇
在自动驾驶栈中,除开基础感知能力外,其规划模块对其应用瓶颈也负有最大责任,它决定了智能汽车在任何给定情况下应该做什么。传统的基于规则的规划方法会选择最小化手工设计的损失函数的轨迹。为了提高其性能,工程师必须为每种驾驶场景设计损失函数中的新项或重新调整各自的权重。这个过程很昂贵,而且很难适应新的地域。与感知不同,传统的智驾规控方法几乎没有从现代机器学习技术中获益,现代机器学习技术利用大量数据以避免手工设计规则。
作为自动驾驶的关键技术,近年来不断有新的轨迹预测思路和算法被提出,尤其是针对复杂交通场景中的对象轨迹预测。通过将复杂交通场景中的预测对象分为:车辆轨迹预测和行人轨迹预测2类。在chatGPT中,通过引入人类驾驶的真实接管数据,在其中尝试使用「人类反馈强化学习(RLHF)」。机器对于数据判别的好坏是能够通过学习实现分类优化的。那么采用人类反馈强化学习的思想,可以训练出模型来验证、评价机器模型的输出,使其不断进步,最终达到人类的驾驶水平。
本文在此基础上对不同预测对象采用近年来的主流预测算法进行分类总结。
基于chatGPT的新思维,我们可以直接从驾驶员实际操控中学习的自动驾驶的机器学习策略。且激发的自动化思维(如自主学习反馈、自主标注等)比手工工程方法的扩展性好得多,其最终的目的是建立在自学习过程中建立安全网络SafetyNet。安全网络SafetyNet 则是利用专家系统的优势来保证特定场景的某些确定性、合法性和安全规则,同时依靠机器学习的运动规划器生成标称轨迹。
The figure below shows a typical enhanced learning planning scheme. Its input is through a map (or proxy engine) to the security control network. In this network, the machine learning predictor is first used to make reasonable trajectory predictions based on the previous control status, and the corresponding execution trajectory is fed back based on the feedback control layer and input into the output layer to form a safe control plan.
The neural network architecture of the machine learning planning model we propose here is inspired by vector nets. The vectorized information for each agent and map element is encoded by the point cloud PointNet network. Similar to chatGPT, the local information encoding will be combined into a global embedding with reference to the Transformer. The embeddings are then transformed into concrete action items via a kinematic decoder.
First, combine the advantages of machine learning planners to build an autonomous driving system that combines explainable safety with a rule-based system that can provide the necessary conditions for the safe deployment of these systems in production.
Second, the machine learning component is a high-volume planning strategy trained from expert demonstrations, whose performance increases with the amount of training data. To improve system safety, machine learning planner decisions can be passed through a lightweight fallback layer: a simple, rules-based system that develops a small set of checks to test decisions and can minimally modify them if needed. Modified to improve security. This allows SafetyNet to transparently enforce safety and legal constraints, such as collision avoidance, road rule violations, and maximizing comfort metrics.
Based on the above analysis, we summarize the following pattern rules that reinforcement learning of planning trajectories should refer to:
1. Dynamic flexibility:
Flexibility is required by evaluating whether the input trajectory remains within the dynamic limits of the intelligently driven vehicle. Specifically, each trajectory state we evaluate requires parameter checks, including whether longitudinal bump, longitudinal acceleration, curvature, curvature, lateral acceleration, and steering jitter (curvature × speed) are within reasonable limits.
2. Driving legality:
The bounds for these parameters are obtained from real-world vehicle testing. In practice we usually use more conservative limits for bumps, longitudinal acceleration and lateral acceleration to stay within comfortable limits. For the predetermined trajectory, it is also necessary to evaluate whether it violates traffic regulations. If any of the following violations occur, the track will be marked as unavailable.
3. Collision possibility:
By checking whether each state in a given trajectory collides with the predicted pose of other agents in the internal prediction module. Collision detection is performed by rasterizing future agent predictions and checking for overlap with planned poses. Additionally, corrections are required by examining longitudinal distance along the trajectory, collision time, and time advance bias. If any collision possibility check fails, the trajectory will be marked as infeasible.
4. Feedback trajectory generation:
Assuming that the trajectory generated through machine vision is marked as feasible, the intelligent driving system will directly execute the trajectory. If the trajectory is marked as "infeasible", the intelligent driving system will select a feasible and selective fallback trajectory that is as close as possible to the machine learning trajectory.
For this reason, if reinforcement learning is considered for the purpose, the trajectory generation method based on τi can be used to generate many lane-aligned trajectory candidates τi. These candidates include speed maintenance, distance keeping, and emergency stop maneuvers. It can be easily adapted to specific scenarios of interest.
Each generated trajectory is checked for feasibility and the candidate trajectory most similar to the ML trajectory is selected for execution:
Summarize
This article introduces the reinforcement learning thinking of chatGPT and continuously iteratively updates the model to obtain the desired output results for the current scenario. Of course, our intelligent system is adaptable, through continuous self-learning of safe interaction capabilities, and a driving model learned through reward-reinforcement mechanism.
In fact, in the continuous learning process of the autonomous driving system, there are some methods (such as game methods) that can completely describe this improvement process. If the R&D thinking of chatGPT is applied to the autonomous driving system, and reinforcement learning methods are used to effectively improve its development performance, an explainable and explicit solution can be provided to simulate the dynamic interaction of a human driver controlling the car. At present, most companies have adopted a rule-based approach to implement decision-making including assisted driving and autonomous driving functions. This approach can greatly ensure the safety of decision-making when dealing with different scenarios.