What deterministic execution
Deterministic execution refers to: given input data, consistent output is produced within a limited time, that is to say, the behavior is reproducible. This is a concept of deterministic execution.
It mainly includes two attributes:
Predictability: Need to know exactly the time window for function execution
Reproducibility: The same input is required to produce the same output
There are three main concepts of deterministic execution:
1. Time certainty;
2. Data certainty;
3. Complete certainty.
Time determinism means that the output of the calculation always occurs before a given point in time, that is, the program cannot run indefinitely and needs to have a time limit.
Data determinism means that given the same input and internal state, the calculation will always produce the same output, which means that the output only depends on the input and has nothing to do with the process.
Complete certainty refers to a combination of time certainty and data certainty.
Let’s take a look at time certainty first. The requirement for time certainty is that every time you start a calculation, you need to ensure that the result is obtained within the specified time. In order to meet such requirements, certain restrictions must be imposed. In order to ensure time certainty, the following requirements need to be met:
1. It is necessary to ensure sufficient computing resources, such as processor time, memory, and service response time.
2. When the deterministic assumption is violated, it must be regarded as an error and a recovery operation must be initiated. For example, when the time we set is exceeded, a recovery operation must be initiated and reported to the PHM as an error.
Data determinism means ensuring that the output results only depend on the input data.
Data certainty can be met in the following ways:
1. Hardware lock-step: execute two different execution paths simultaneously, and then compare their results to verify consistency.
2.
Software lock-step: software can be executed multiple times in parallel or sequentially.
Complete certainty means that the certainty of time and data must be guaranteed at the same time. It should be noted that the current AP AUTOSAR only specifies complete certainty on one Machine.
Determinism across multiple machines has not yet been specified, so currently
the use of
full determinism is limited to one machine.
2. Why Deterministic Execution
Why use deterministic execution? Mainly functional safety requirements:
For example, highly automated driving systems need to meet ASIL-D.
In order to meet the requirements of ASIL-D, highly automated driving systems need to take specific measures, especially software lockstep, because there is currently no hardware above ASIL-B.
In addition, the transient hardware error rate of HPC is very high, so the system needs to support predictability and reliability. Leveraging deterministic execution ensures predictability and reproducibility of software behavior.
So, when there are certain requirements for functional safety and it uses this kind of HPC, we need to consider using deterministic execution.
Before CP AUTOSAR 4.3 and before AP AUTOSAR 1810, AUTOSAR did not have any deterministic content related to the conceptual design and development process. In order to achieve some safety requirements, the AUTOSAR standard added deterministic content after CP 4.3 and AP 1810.
Next, we conduct an in-depth analysis of the determinism in AP AUTOSAR.
3. AP & Deterministic Execution
The module related to deterministic execution in AP only needs to be execution management. The following figure is an overview of deterministic execution interaction in the execution management module:
In the above figure, there is a user process in the application layer, and the user process will call the API of the deterministic client to achieve deterministic execution.
Deterministic client classes (which exist in the code as C++ Classes) are provided by execution management. There are also some classes related to the execution client in EM. This article focuses on the deterministic client and does not discuss the execution client in depth.
The implementation of deterministic execution in AP is mainly realized through a Class such as deterministic client. The deterministic client mainly includes the following contents:
1.
Control the internal cycle of the process
2. Deterministic Worker Pool (worker pool)
3. Provide activation timestamp and random number
Deterministic clients typically interact with software lockstep to ensure that redundant processes execute the same behavior.
That is, when we have a user process, we also need another user process (redundant execution process). The content executed by the redundant execution process and the user process must remain the same. At this time, a software lock-step framework is needed to ensure that the user process and For execution consistency between redundant processes, the software lock-step framework is shown in the figure below
:
In the figure above, the user process is at the top and the redundant user process is at the bottom. The black line is the data flow, and
data interaction between processes needs to be based on ara::com. Then ara::com will interact with the software lockstep framework to synchronize the data input of the user process and the redundant process.
At the same time, user processes and redundant processes also need to interact with the software lock-step framework to ensure the
same
output
.
Finally, the output of the user process and the redundant process will be compared. The specific comparison process and synchronization process are all done within the software lock-step framework.
Cycle Deterministic Execution: Control
Next, let’s share about periodic deterministic execution, which is actually a type of deterministic execution. Here we will open the redundant execution process to gain an in-depth understanding of one of its execution processes.
As mentioned earlier, determining the execution process mainly includes the following aspects: control-related; Worker Pool-related; and machine number-related.
The content related to control is an API, an API that controls triggering and repetition, allowing the main thread code in the process to execute periodically.
What is the purpose of the control API? It controls the execution of the process by blocking wait points, as shown in the figure below:
The picture above is an example of turning on redundant execution
.
As can be seen from the figure, the process is started first, and then the reporting execution status API is called. This API is in the Class of the executing Client
.
需要说明的是,这里会有几个概念比较混淆:进程的状态、进程的执行状态、功能组的状态;这几个概念都是跟确定性执行相关的。我们在启动的时候报告的进程的执行状态,即用户进程
给执行管理“我要执行了”,在当前
AP
版本中执行状态是一个枚举类型,当前只有一个值:
kRunning
。
报告完之后
,用户进程会调用一个由确定性客户端提供的 API “WaitForActivation API”,这是一个等待点 API,就是上图中绿色的点,它描述的是一种状态的属性。
改 API 会返回
一个
值,主线程会根据返回值,去执行不同的循环。
循环执行完之后,如果再次调用 WaitForActivation API,意味着等待下一次激活。
我们一直在说
WaitForActivation API,但是
从上图可以看到,图中并没有
WaitForActivation API,只有一个
WaitForNextActivation API。为什么
?这是因为 AP
2011版本把 WaitForNextActivation 废弃了,改成了 WaitForActivation,我们以新版本进行的说明。
再来看一下
WaitForActivation
。刚才也提到了,主线程会根据 WaitForActivation的返回值去判断该怎么去执行循环,这个返回值其实控制的就是进程的执行模式,包括以下几种:
无专用预算需求
需分配预算:
-
kInit:进程初始化其内部数据结构
-
kRun:进程执行其正常周期执行的一个周期
-
kTermate:进程准备终止
如下图红色框住的部分所示:
kRun 是指周期性执行 kRun 的激活行为,包括两种:
1. 定期激活:WaitForActivation 根据定义的周期,定期返回
2. 事件触发激活:WaitForActivation 由外部 CommunicationEvent 触发返回。如由于数据到达后而产生的 Event 或 Timer 事件。
在软件锁步中,我们是使用事件触发的激活行为来初始化和触发冗余执行进程的。
需要注意的是,当它是冗余执行进程,而且它是事件触发周期激活的时候,它循环的一个 CycleTimeValue 是等于 0 的。
Worker Pool 是一个锁定 API,它是在进程执行周期内使用的 ,它通过使用不同的线程,不同的工作池来加速软件的一个执行。
Worker Pool 是由进程的主线程调用 API
触发的
。
Worker Pool 调用完之后,它的一个呈现形式是进程池,进程池中有多个 Worker。它跟主线程之间是没有并行性。
Worker Pool 的工作时会注册以下内容:
1. Worke
2. Worker Runnable Object
3.
参数 Object
需要说明的是 Worker 会有多个,多个Worker组成了 Worker Pool,
Worker Pool下面是 Worker Runnable Object,它只有一个
。需要注意的是多个
Worker 之间是不允许进行一个数据交换的。
接下来分享一下确定性的随机数。这个随机数是根据确定性算法生成的,所以它生成的随机数是伪随机数。
它有会提供一个 Get 随机数的函数,供用户调用。
这就是随机数主要的作用,一是算法中可能会用到
随机数粒子过滤器
。
另一个考虑确定性执行,我们需要将冗余进程以及用户进程之间的随机数种子进行同步。
有两个地方会提供 Get
Random
随机数 API,一个是确定性客户端 Class,一个是
Worker 线程 Class。
当用户进程去调用 GetRandom 时,它使用的是确定性客户端提供的 API 。
Worker 调用的是
Worker 提供的
Get Random,当
Worker 调用完
Get Random 后,
它会使用 Container 迭代器,然后将随机数分配给特定的 Container 元素,Container 就是参数 Object,参数 Object 之间它会有迭代。
Worker 调用 Get Random 之后,它会将随机数分配给特定的某一个元素,来去保证确定性冗余执行。
至于我们的用户进程也是一样的,用户进程中的非冗余进程中的 Worker 调用 Get Random 随机数,拿到那个随机数也是需要给到特定的 Container 迭代,然后去把它的参数进行使用等这些。
接下来是时间戳,时间戳是指当前周期被激活的时间点以及下一个周期被激活的时间点。当
我们配置了下一个周期激活的时间时,它会返回这样的时间点。
如果没配,就是仅返回当前周期被激活的时间点。
时间戳主要在以下方面应用:
周期性处理数据时:可能需要用到时序信息
确定性冗余执行需要同步时间戳:冗余进程和用户进程之间需要同步时间戳
这个时间戳是通过调用确定性客户端 API“获取激活时间戳”来获得的。
之前提到 WaitForActivation 它的返回值会传递到下一个周期,
时间戳
表示的就是 WaitForActivation 返回 kRun 触发激活的时间点,在返回 KRun 的时候,把这个时间点提供给进程。
除此之外,还有一个就是获取下一个激活时间戳,就是表示下一个 KRun 周期的时间点,如果有的话就返回。
需要注意的是
冗余执行进程提供的时间戳,应该与主进程提供的时间戳相同
。
在 2011 更改了获取激活时间戳的传参类型以及返回值类型,
在1911的时候,有个传参就是时间戳,但是
在 2011 直接把这个传参给去掉了,然后多了个返回值,类型也不一样,在使用不同版本的工具的时需要注意一下。
Deterministic Sync Master:同步控制点
接下来分享一下确定性同步 Master,它其实就是一个同步控制点,
为WaitForActivation 中的定期激活与事件触发激活提供同步行为,
举个例子下图所示
:
上图中有两个用户进程 APP1、APP 2 ;它俩都会去调用它们
各自
的执行管理的确定性客户端。
如果它们之间要进行同步行为,就需要用到确定性同步 Master。首先确定性同步 Master 会等待一个请求,
然后 APP1 要跟 APP2 同步,APP1 会调用 WaitForNextActivation,然后
调用 ara::com 的Send 同步请求给到确定性同步 Master 。
APP2 也是一样,它也会通过 ara::com 发送同步请求。同步请求完之后,确定性同步 Master 就会计算双方执行的下一个周期。
比如说 APP1 先去触发了这样的一个执行,APP2 也先去触发了,要想把它俩同步,只能是在它们下一个周期去进行同步,所以
Master
需要计算它们之间的下一个时间。
然后
Master 会将
同步响应发送给它们,它们就会在下一个周期根据这个响应消息来激活时间,进行同步触发,APP1 和 APP2 就完成了同步触发的流程。
有几点是需要注意:
需要设置相关的设置
1. 连接到 DeterministicSyncMaster 的 DeterministicClients 的数量
2. 已连接的 DeterministicClients 所需同步请求的最小数量
3. DeterministicClients 的 kRun 循环总数
同步类型
1. 对于单域同步, DeterministicClients 和 DeterministicSyncMaster 使用本地时间资源,如使用 std :: chrono API
2. 对于多域同步, DeterministicClients 和 DeterministicSyncMaster 使用全局时间资源,如 GPS 时间
Deterministic Sync Master:部署
怎么去部署我们这个确定性同步 Master?它有几种不同的部署方式:
1. 可以直接部署在单个进程里
2. 可以部署在软件锁步的进程里
单个进程部署如下图所示:
部署在
软件锁步进程时,分为两种模式:
1. 进程模式(下图左)
2. 库模式(下图右)
Deterministic Sync Master:同步控制消息
同步控制消息要指将请求和响应的消息进行一个同步
于我们发送的同步请求是需要包含以下数据:
1. Service ID:发送同步请求的Service ID
2. Instance ID:发送同步请求的进程 Instance ID
3. 前一周期的激活时间戳:用于计算下一个周期激活时间
4. 当前周期的 Code:kServiceDiscovery; kInit; kRun
5. 当前周期数:用于指定何时返回 kTerminate
对应的同步响应是需要有以下数据:
1. Service ID:发送同步响应的 Service ID
2. Instance ID:发送同步响应的进程 Instance ID
3. 下一周期的激活时间戳
4. 下一周期的 Code:kRun; kServiceDiscovery; kTerminate
接下来
分享一下 CP AUTOSAR 中的一些确定性执行,
跟 CP AUTOSAR 相关的确定性执行主要是
Timing 扩展
(
时序扩展
)和逻辑执行时间
。
时序
扩展的目的就是为了提供指导构建系统的时间相关的需求,这是时序扩展的一个目的。另外一个目的是提供足够的时间信息来分析和验证整个系统行为。
The more important concept in timing extension is the timing view, which includes the VFB timing view, including timing information, timing description, timing constraints, etc. related to the VFB view.
In the same way, there is also SWC timing, which includes timing information, timing descriptions and timing constraints related to SWC, as well as timing information, timing descriptions and timing constraints related to the system view.
What about other than that? There are also BSW Module Timing, BSW Component Timing, and ECU Timing. ECU Timing actually refers to the timing information, timing description, and timing constraints related to the ECU view.
The logical execution time is LET, which determines the time it takes from reading the program input to writing the program output, as shown in the figure:
The red mark on the left is the time point when the program input is read, and the red mark on the right is the time point when the program output is written.
It actually has nothing to do with the actual process execution time, that is to say, we only determine two input and output points, and it has nothing to do with the actual process execution time. It is actually a formal description of function operation and synchronization, and has nothing to do with the target hardware.
Integration of LET and CP development processes
How LET (logical execution time) is used in CP development, as shown in the figure below:
As shown in the figure above, the initial architecture design is the same as the traditional CP process. After the architecture design, you need to consider using the LET model to perform a calculation. The main thing to do is to simulate a configuration of the LET, which is related to the LET.
The next step is to use tools to generate scheduling results: including real-time kernel configuration, etc., and then verify the scheduling results, etc.