Automotive SoC Functional Safety Best Practices and Challenges

Latest update time：2022-08-07

Reads：

The main content of this article is divided into 3 parts (about 10,000 words, 50 minutes to read)

Table of contents

01	Functional safety of automotive SoCs
02	Current and emerging challenges
03	Research summary

1 ►

Functional safety of automotive SoCs

In this section, we will detail functional safety development of hardware and software in safety-related automotive SoCs. There are two paradigms on how to develop safety-related automotive SoCs. On the one hand, SoCs can be custom developed to meet the specific requirements of a Tier 1 supplier for a specific system. In this case, SoC development takes system-level technical security concepts as input to derive SoC-level technical security concepts. On the other hand, an SoC can be developed as a Secure Element Off-Custom (SEooC) , which means that it is not tied to a specific system but can be used in similar applications across several systems. In this case, SoC development starts with system-level technical security concepts and assumptions about the system design. The SEooC development model has grown in popularity recently as semiconductor companies strive to drive value creation by using the chips they design as design references for Tie 1 suppliers. The development of SEooC requires more effort because it tends to make more conservative assumptions about the system.

Please note that since the automotive SoC is only part of the functionality, many of the main steps in the concept phase of the item (item definition, HARA and functional safety concepts) will not be present in the development of the automotive SoC. The other steps in the development process of safety-related SoCs are similar to those described in the cropped "ISO26262: Risk-based development approach" section, except that safety validation is performed only at the vehicle level. Therefore, in this section, instead of repeating the explanation of the process, we will delve into some of the technical activities involved in the process. Since the concept of faults is crucial to understanding safety architecture and implementation, we first review faults from a functional safety perspective. We then discuss security analytics used to drive the development and validation of security architectures. We also detail security mechanisms commonly used in automotive chips and then discuss the verification of security-related SoCs.

Functional safety failures

Reduce risk by detecting, controlling, or mitigating failures that could lead to violations of safety objectives. ISO 26262 distinguishes between faults, errors and failures. ISO 26262 distinguishes between faults, errors and failures. An abnormal condition that may cause an element or related item to be a fault. Error is the difference between a calculated, observed, measured value or condition and a true, stated, theoretically correct value or condition. Failure is the termination of the ability of a component or an associated item to perform its function as required. If faults are not controlled or mitigated in a timely manner, faults may develop into errors and ultimately failure. Note that a component-level failure may result in a system-level failure.

ISO 26262 covers the following two types of failures:

Systemic failures are caused by specification or design issues and manifest themselves in a deterministic way. Systemic failures can occur in software and hardware only through improved development processes, including security analysis and verification. The most common systemic failures are development defects.

Random hardware failures occur unpredictably over the life of a hardware component and are due to physical processes such as wear, physical degradation, or environmental stress. Random hardware failures can be reduced through reliability engineering, but cannot be completely eliminated.

Random hardware failures can be further divided into the following two categories:

A permanent failure occurs and remains until removed or repaired. Examples include stuck faults and bridging faults.

Transient failures occur once and then disappear. Transient failures may occur due to electromagnetic interference or alpha particles, for example. As technology nodes shrink, memory elements such as flip-flops and memory arrays become increasingly prone to transient failures. Examples include single-event upsets (SEU) and single-event transients (SET) .

Security mechanisms are measures and technologies built into products to detect, control, and mitigate random hardware failures. The overall effectiveness of the security mechanisms in an SoC security architecture can be quantified through hardware architectural metrics. Before giving the definition of this indicator, we first introduce the fault classification in ISO 26262, as shown in the flow chart in Figure 1. This classification assumes a safety goal and relies heavily on a judgment of the impact of a failure on the safety goal.

Figure 1. Fault classification defined by ISO 26262

(Aware multi-point failures are not considered here as being irrelevant to the SoC level)

To classify a fault, we first need to determine if the fault is within a safety-related element. If not, the failure is insignificant and can be considered a safety failure and not included in the safety analysis. If the element is safety-related, the question is whether, in the absence of safety mechanisms, the failure itself has the potential to directly violate safety objectives. If not, you can put the fault aside for now and think about it later. If the failure may violate safety objectives, the next question is whether a safety mechanism exists. If not, the failure is a single point of failure (SPF) , which is generally undesirable for a security architecture. In the presence of safety mechanisms, not all faults can be covered. If a fault cannot be covered by a safety mechanism, it is classified as a residual fault.

For single points of failure that are covered by safety mechanisms, they will not violate the safety objectives in the presence of safety mechanisms. Together with faults that have no potential violation of the safety goal (which we set aside earlier) , they are evaluated based on their likelihood of violating the safety goal as well as another independent fault (a second-order effect) . If there is no such possibility, the fault is considered a safe fault. If a double-point failure has the potential to violate safety objectives, evaluate whether safety mechanisms are in place to prevent it from occurring. If there is no safety mechanism in this area or the safety mechanism cannot cover it, it is classified as latent multi-point faults (MPFs, Latent) or simply latent faults. Otherwise, it is detectable multi-point faults (MPFs, Detected) . Instead of considering the fault together with two other independent faults, ISO 26262 considers multi-point faults of order greater than 2 as safe faults (with an extremely low probability of occurrence in the automotive field) .

Also, note the difference between latent faults and latent defects. A latent defect is a reliability issue that goes undetected during production testing and manifests itself during the life of the device. A latent fault is a safety issue where the fault does not cause harm by itself but may cause harm in the presence of another independent fault. In functional safety, latent defects can belong to any fault category.

security analysis

安全分析方法可用于构建 SoC 安全架构、识别系统弱点和分配安全机制。安全分析最初可以在架构设计阶段进行，之后可以作为验证安全架构实施鲁棒性的一种手段来执行。我们回顾了汽车 SoC 开发中使用的几种常见安全分析方法。

故障树分析

FTA 是一种自上而下 （演绎） 的分析方法，从故障影响开始分析所有可能的故障原因。它使用故障树作为故障逻辑组合的图形表示。图 3 显示了一个示例故障树，用于分析微控制器单元 (MCU) 的故障。它从一个顶事件开始，然后使用 AND 或 OR 等逻辑门将其分解为底事件。对于每个安全目标，可以绘制故障树，顶事件通常是导致违反安全目标的危害。在此示例中，顶事件是 MCU 执行错误计算而没有指示。MCU的顶事件可归因于 MCU 子组件失效的逻辑组合。可以进一步分析 MCU 的子组件失效，例如CORE0，以将原因追溯到其子组件。底事件是无法进一步分析的事件，它在叶节点处停止故障树分析。通过将失效原因追溯到底事件，安全架构师可以识别架构中的位置，以分配安全机制来控制基本事件。因此，FTA 可以作为推动安全架构发展的有效工具。

在构建故障树后，可以进行割集分析以确定安全架构中是否存在单点故障。割集是可以导致顶事件的底事件的组合。具体来说，如果当从集合中删除任何底事件时，其余事件将不再是一个割集，则称该割集为最小割集。割集的阶级是指割集中底事件的数目。例如，在图2中，EV1和EV2构成了2阶最小割集。1阶最小割集表示安全架构中存在单点故障，这表明应该添加安全机制来控制它。虽然FTA通常用作定性分析方法，但FTA也可进行定量分析，可以计算顶事件的概率。

图 2. 用于分析 MCU 失效的故障树示例

失效模式和影响分析

FMEA 是一种自下而上（归纳）的分析方法，重点关注系统的各个部分如何失效（失效模式），以及这些失效对系统的影响。FMEA 可以作为 FTA 的补充，并可用于交叉检查。

失效模式、影响和诊断分析

FMEDA 是一种用于识别和评估失效模式、影响和诊断技术，以及记录系统的系统方法。在FMEDA 中的硬件元素，为元素的每个组件识别原始失效率和失效模式，以及失效模式分布。然后，评估失效影响——失效模式是否有可能违反安全目标。此外，识别与失效模式及其诊断覆盖率有关的安全机制。基于上述数据，FMEDA 通过硬件架构指标来量化安全架构的鲁棒性：单点故障指标 (SPFM) 和潜伏故障指标 (LFM) 。

假设与安全相关的硬件元素的原始故障率为 λ。从故障分类中，我们有

(1)

其中λ _SPF 是与单点故障相关的失效率，λ _RF 是与残余故障相关的失效率，λ _{MPF ,D} 是与探测到的多点故障相关的失效率，λ _{MPF ,L} 是与潜伏多点故障相关的失效率，λ _S 是与安全故障相关的故障率。

诊断覆盖率可以声明与残余故障和潜伏故障有关的安全机制。SPFM定义为

(2)

其中是要考虑的安全相关硬件元素的λ x 之和。 SPFM通过安全机制的覆盖率或设计 （主要是安全故障） 来量化硬件元素对单点故障和残余故障的鲁棒性。高SPFM意味着硬件元素中单点故障和残余故障的比例较低。下表显示了SPFM从ASIL B到ASIL D的目标值。

LFM 定义为

(3)

其中是要考虑的安全相关硬件元素的λ x 之和。 LFM通过安全机制或设计来覆盖量化硬件元素对潜伏故障的鲁棒性。高LFM意味着硬件元素中的潜伏故障部分较低。下表显示了 ASIL B至ASIL D的LFM目标值。

除了SPFM和LFM，还可以计算硬件随机失效 (PMHF) 的概率度量，以评估违反系统级安全架构安全目标的总体概率。这是为了证明由于随机硬件故障而导致违反安全目标的剩余风险足够低。虽然PMHF通常是在系统级计算的，但有时会假设PMHF在SoC层级不应超过特定安全等级对应的量化要求，并提供给系统设计人员以供参考。PMHF可以通过使用 FMEDA或定量 FTA 来计算的失效率。

相关失效分析

相关失效分析 (DFA) 包括识别和分析给定元素之间可能的共因失效和级联失效，违背安全目标 （或安全需求） 的风险评估以及定义减轻风险的必要安全措施。其目的是评估潜在的安全概念的弱点，从而提供满足独立性要求和免于干扰的要求的证据。

相关失效的发起点 （DFI） 表示在安全范围内相关故障的根本原因。这些相关失效的发起点 （DFIs） 无法通过标准的安全分析进行分析，但是可以通过DFA用定性的方式处理。这些DFIs包括以下：

➡共享资源的失效

➡单个物理根本原因

➡环境因素的错误

➡开发错误

➡生产错误

➡安装错误

➡维修错误

需要制定不同的措施来应对不同类型的DFI。

安全设计实现

汽车SoC广泛应用于各种产品，包括MCU或微处理器MPU、雷达前端单片微波集成电路 (MMICs) 、电源管理集成电路 (PICs) 、系统基础芯片 (SBC) 、各种传感器。根据产品架构的不同，汽车SoC中使用了广泛的安全机制。在本节中，我们将根据安全机制的工作方式和用途对它们进行分类。

安全机制按错误检测方法分类

许多安全机制依赖于检测故障和错误的能力。检测方法大致有三种，即冗余、监控和测试。

冗余错误检测利用冗余计算或存储来检测目标功能中是否存在错误。通过有三种不同类型的冗余方法。

➡硬件冗余通常在不同的硬件模块上执行相同的计算，这样，如果功能模块中出现了导致错误的故障，就有可能通过比较冗余模块的结果来检测。双模块冗余 (Dual module redundancy, DMR) 通常用于汽车soc中，其中两个处理单元以相同的方式 (lockstep模式) 执行相同的计算任务，它们的结果由检查器模块进行比较。DMR允许错误检测，但不能自行纠正错误。错误处理通常由系统的其它部分完成。三个模块冗余 (Triple module redundancy,TMR) 允许错误检测和错误纠正，但需要额外的成本。TMR主要用于soc级关键寄存器，例如，存储调整值的寄存器，由三个表决触发器 (triple voting flip-flops,tvf) 实现。

➡信息冗余利用了冗余信息编码的检测和纠正错误。例如ECC (error correction code) 和奇偶校验。这种安全机制通常用于保护数据通信通道和存储器。

➡时间冗余重复执行相同的计算，可能是在同一个硬件上，但是使用不同的算法。通过重复计算，很可能检测到是否有软计算故障。当不同的算法执行在相同硬件元素的不同部分时，用不同的算法甚至可以检测出硬件的永久性故障。

要注意的是冗余通常与多样性结合使用，以避免共因失效。多样性可以是指在时间、算法或物理实现上。不同算法重复执行相同的计算任务是算法多样性的一个例子。在DMR锁步配置中，当处理单元重复时，冗余模块的物理布局通常会旋转以实现物理多样性。此外，在DMR中，通常采用延迟的锁步配置来实现时间上的多样性。图3显示了DMR的延迟锁步配置简化框图。

图3延迟锁步简图

输入数据直接输入到主处理单元而延迟一到两个时钟周期送入冗余处理单元。主处理单元的输出直接输出给系统。同时主处理单元也有分支并经过时间延迟后再与冗余处理单元的输出信号进行比较，这个时间延迟周期应与前面输入给冗余处理单元的延迟周期相同，这种配置可以防止由于时钟故障引起的共因失效。

另一种错误检测方法是连续或周期性地监测关键部件或参数的异常。监控通常假设部件或参数的行为应该在预先假定的正常范围内，并标记不在该范围内的行为。监控的例子包括电源电压、电流、时钟或总线协议接口的监控。另一个例子是软件看门狗，它监控处理器单元是否死机。

测试探测故障是指通过将运行测试模式或程序得到的计算结果与预先计算的结果进行比较。测试和监控之间的主要区别是监控通常是与元素的工作并行执行。然而,测试通常是在该元素没有正常运行时而进入测试模式。因此,与监控相比，测试对正常运行有破坏性。一些测试的例子包括逻辑内置自检 (LBIST) 和内存内置自检 (MBIST) ，通常是当SoC启动或关闭时运行。另一个例如在雷达MMICs中进行环回测试通信数据路径连通性。

安全机制按故障分类

另一个安全机制分类可以基于他们所针对的故障类型。例如，可以根据其目的分为针对单点故障的安全机制还是针对潜伏故障的安全机制。针对单点故障的安全机制是首要的安全机制，是因为单点故障可以直接违反安全目标的故障。通过冗余来故障探测和持续监控通常就属于这一类。潜伏故障通常通过基于测试的安全机制来保护，如LBIST和MBIST。

基于测试的安全机制并不总是用于探测潜伏故障。BIST机制通常不能用来防止单点故障是因为BIST的持续时间比FTTI长，而且天然地对正常运行有破坏性，因此当芯片在运行时，它们通常不能运行。对MCU而言如此，但对于雷达前端的mmic来说，由于雷达工作周期的特点，这并不总是正确的。在某些雷达应用中，每个雷达工作周期分为运行期和和静默期。MMIC在运行期收发数据，在静默期是空闲的，因此，测试通常在静默期。在一些雷达应用中，FTTI被认为是一到两个雷达工作周期，也就是说测试持续时间可以适应FTTI。此外,MMICs作为雷达的发射器和接收器然后将采集到的数据发送到单片机进行处理。雷达数据记录经常靠MCU进行。因此，MMICs往往可以认为是无状态的，可以考虑测试对雷达功能无干扰。

另一种分类可以基于是否安全机制旨在防止永久故障还是瞬态故障，虽然有些可以防止两种。例如ECC可以检测由永久故障和瞬态故障引起的错误。测试机制通常只能用于检测永久故障，作为瞬态故障的影响可能在测试运行时已经消失。

安全验证

安全验证是确保安全需求通过安全架构的实现来满足。注意，在ISO 26262中定义的术语验证与确认与半导体领域定义的有所不同。在ISO26262中，验证方法包括评审，走查、检查、仿真、形式验证，工程分析等等。安全确认具体指整车层面的确认。在本文中，我们重点关注汽车SoC安全问题的前期验证 （通过仿真和形式化方法) 。

功能安全验证的有效手段之一是故障注入行为 (常称为故障注入) 。它注入故障到设计模型，观察故障探测以及安全机制的故障响应。故障注入结果可以作为在FMEDA中的诊断覆盖率的有力证据。故障注入的目标包括:

➡确定安全机制的诊断覆盖率

➡确定诊断时间和故障处理时间

➡确定故障影响

进行故障注入另外的作用还有：

➡设计模型:可以寄存器转化级/门级别，甚至更高的级别 （系统层）

➡故障位置和故障类型:故障列表可以随机选择，也可以来自确定的关键失效模式

➡功能激励:它应表示工作负载或用例

➡观察点:应该观察故障影响和故障诊断点

基于故障注入的测试结果，同时结合专家的判断，根据“功能安全中的故障”一节中的分类标准就可以讲故障分类了。根据故障分类和诊断覆盖率，依据详细设计实现中得到的更准确的数据来更新FMEDA报告。我们将在下一节讨论故障注入的技术挑战。

2 ►

当前及新出现的挑战

在本节中，我们识别出了为了获得当前和下一代产品的功能安全，在汽车SoC设计和验证方面的几个挑战。

在安全性和PPA之间的权衡

获得功能安全是要付出代价的。安全机制的实现必然带来在性能、功率和面积上的开销 (Performance,Power,Area,PPA) 。例如，DMR增加了超过一倍的面积。运行整个芯片的LBIST导致很大的功耗。现有的安全分析只关注故障和诊断覆盖率，并没有从定量角度去考虑设计上的花销。从PPA角度，安全分析和设计的过程往往是单独进行的，虽然SoC架构师和安全架构师也意识到功能安全和PPA的权衡，但通过系统性的分析以便得到最优的架构仍是个挑战。这样的分析可以从架构定义的早期开始同时在不同设计阶段不断迭代。传统设计过程中需要考虑的参数已经十分巨大了，还要增加功能安全维度，这将使问题更有挑战性了。

故障注入活动的挑战

故障注入活动是验证安全机制有效性和确认安全分析中声称的诊断覆盖率的有效工具。目前，故障注入活动面临几个主要挑战：

对于当代汽车SoCs来说，故障范围是巨大的。如果我们使用低层级的故障模型，考虑到当今汽车SoCs的大小，在一个大小合理的IP块中可能会有数百万个故障。如果再考虑到瞬时故障，其中额外的时间维度使得故障空间更加难以处理。有时，会模拟数十个或数百个测试，以确保故障注入的功能能够代表SoCs的实际工作负载，这使得它在计算上更加困难。

在实际应用中，采用人工选择和统计抽样的方法来减少故障空间。人工选择的局限性在于，它通常需要专家的判断和对特定设计的深入了解。它也很难计算出人工选择故障的概率分布，以计算诊断覆盖率。一种常用的统计抽样方法是基于置信水平和置信区间的抽样。其局限性在于，当置信水平较高且置信区间较窄时，它通常给出一个非常保守的边界，因此，样本量可能仍然很大。

对于数字电路，对故障模拟技术的研究已经进行了至少三十年，并且市场上可以买到先进的算法，通过同时模拟数千个故障来加速模拟。然而，对于模拟电路的故障模拟，即使同时模拟故障，也是非常具有挑战性的。在某些情况下，已经有研究使用敏感性分析来加快模拟速度。然而，并行故障模拟在模拟电路领域的普遍应用仍然是一个悬而未决的问题。对于模拟和混合信号电路，当前的仿真技术侧重于低层级故障模型，这限制了它们在更大范围内的使用。正确的抽象级别故障建模非常有助于模拟和混合信号SoCs的故障模拟。

为了分析故障的传播和检测，在故障注入中也引入了形式验证方法。这里有几个限制，形式化方法的可扩展性本质上妨碍了对大型设计的应用。此外，如果设计的环境约束没有得到适当的表述，形式化方法往往会发现不切实际的情况，这可能会成为工程调试时间的黑洞。

基于测试模式的安全机制的诊断覆盖率

如今，基于测试模式的安全机制越来越受到重视，因为它们需要更少的硬件资源，并且更加灵活。然而，这种测试模式的开发工作和复杂性可能是具有挑战性的。

LBIST是一种传统上常见的基于测试模式的安全机制，迄今为止，它在功能安全方面仍存在局限性。LBIST旨在防止潜在故障。然而，LBIST模式的构建旨在覆盖结构故障，而不考虑功能安全背景下的故障分类。LBIST的PPA开销巨大。扫描链的使用使得测试时间很长，因此，要满足客户的要求往往很困难。减少测试时间的技术通常会导致过度功耗，因为它们往往会增加芯片上的同时活动。因此，业界一直在转向使用功能测试模式的自检机制，因为它们更灵活、更轻量级，并且可以针对感兴趣的故障进行开发。通过在开发软件测试库方面付出的努力，使得用户不仅可以在重置时运行测试，还可以在应用程序空闲时运行测试。

尽管有明显的好处，但功能测试模式的开发在技术上仍具有挑战性。功能测试模式不能利用扫描链等可测试性设计 （DFT） 特性，因此，在故障的可控性和可观察性方面可能会受到限制。由于缺乏功能测试生成工具，可能需要花费大量的工程工作来手动起草测试，以实现所需的诊断覆盖率。为了应对这些挑战，迫切需要研究DFT技术来帮助功能测试生成。

新兴加速器的安全机制

随着特定领域计算的普及，加速器在芯片产业中占有越来越大的份额。在汽车领域，正在为视觉处理、雷达和激光雷达数据处理以及深度神经网络 （DNN） 推理设计新的加速器。与SoC上的通用组件 （如CPU、结构和内存） 不同，加速器是为特定领域的计算任务而设计的。将DMR等传统安全机制简单地应用于加速器可能既不有效也不经济。

为了有效地设计新兴加速器的安全机制，利用加速器的特定领域特性将是有益的。它需要硬件和软件的交替思考，以及对系统级安全机制的深入理解。为特定领域的加速器设计有效的安全机制是一个开放的研究领域，该领域的创新备受追捧。

实现失效可运行的挑战

The development trend of autonomous driving systems requires future automotive electronic/electrical systems to be able to operate despite failures. This requirement may also be passed on to automotive SoCs, meaning the SoCs can continue to operate normally or operate in a degraded mode. Intuitively, this can often be achieved with redundant computing resources. However, since the automotive market remains a cost-sensitive market, this means that such fail-operational behavior should be achieved without causing an unacceptable increase in resources.

Virtualization is a possible development direction because it can provide high availability to achieve runnable behavior in case of failures. It requires efficient fault localization on the hardware and error handling on the software level. Although virtualization has proven successful in cloud computing, there are still many unanswered questions in applying it to automotive embedded applications.

3 ►

Research summary

This section highlights some of the recent updates and research efforts proposed to enable functional safety in automotive SoCs to enter the era of autonomous driving. This is by no means a comprehensive review, but its purpose is to give readers a taste of some of the interesting stuff in this article.

Enable application-specific security mechanisms

Since violations of safety objectives are closely related to the functionality of the item in question, awareness of this functionality will enable safety mechanisms to detect and control failures most effectively at the system level. The challenge with automotive SoC development is that the details of the application are not visible, especially if it is an MCU developed as an SEOoC. Therefore, it would be ideal if there were configurable and extensible mechanisms in the SoC that could be provided to Tier 1 vendors to enable application-specific protection.

A notable recent innovation is the software security concept, implemented through configurable security mechanisms such as Time Monitoring Comparator (TMC) and Timed Multi-Watchdog Processor (TMWDP) . TMC works in a software-locked-step environment, where the same computational task is performed by two software threads, most likely using different implementations. One software thread may be accurate and consume resources, while another thread may use fewer resources to produce less accurate results within a certain range. Traditionally, the software locking step requires both software threads to synchronize and periodically compare their results before continuing, which can impact performance. TMC improves performance by using a dedicated hardware monitor to compare the results generated by two software threads at controlled intervals. Therefore, it ensures that the results of the two threads are comparable and that the progress of the two threads is within a limited time interval.

TMWDP protects the integrity of application software control flow. It is assumed that system application developers understand the high-level control flow of application software. The timing watchdog is a timing state machine converted from the control flow. It checks for bad state sequences and bad state sequence timing, as well as starvation (the application stays in a state for too long) .

Security of Deep Neural Networks

With recent breakthroughs in deep learning applications in computer vision, DNNs are becoming increasingly attractive on the road from ADAS to autonomous driving. Significant research and development efforts have been expended in exploring and deploying DNNs for perception tasks such as pedestrian detection, vehicle tracking, road sign classification, and distance detection . Some are even trying to use DNNs for end-to-end autonomous driving. Specialized accelerators have been developed to support the deployment of DNNs for real-time applications. However, as always, security concerns for such applications and special accelerators have been a critical concern.

The use of DNN consists of two stages: training and inference. Training refers to the process of finding the optimal model suitable for training samples without losing generality. The model is essentially a set of weights associated with neurons in a DNN architecture. The model can be stored on the chip and loaded into applications to make predictions on new data samples, which is called inference.

The safety of DNNs includes two aspects: safety of expected functions and functional safety. The safety of the intended functionality involves the following question: If my DNN model classifies a stop sign as a speed limit sign, what will be the safety impact? How can I mitigate this effect? Functional safety involves the question: If I have a flaw in my DNN accelerator that changes its assumed behavior, what are the safety implications? Failures may occur during training or inference. Since training is usually done offline (as part of development) , failures in the training phase can be controlled through exhaustive validation. The inference phase mainly focuses on the failure impact and mitigation of random hardware faults.

Recent work explores error propagation of faults in modern neural networks and proposes safety measures based on experimental learning. These works focus on studying the impact of inference engine architecture and design parameters on security. Additionally, the security impact of hyperparameters during training on inference can be explored. More specifically, Dropout is a recent regularization technique in DNN training to avoid overfitting. The idea of dropout is to disconnect the neural network layering during training so that activation is not dependent on a few neurons. Dropout increases information redundancy in the model and therefore may make the inference phase more resilient to failures. It would be interesting to explore the implications in quantitative terms.

We provide an overview of functional safety development for safety-related automotive SoCs. We describe the overall process of functional safety development of related items and illustrate the practices for achieving functional safety in automotive SoCs. We highlight the challenges and research efforts to achieve functional safety of automotive SoCs in ADAS and future autonomous driving applications. Although this article covers a broad range of topics related to functional safety in automotive SoCs, we believe we have only scratched the surface of challenges and research opportunities. Additionally, due to space limitations, certain topics related to ISO 26262, such as ASIL decomposition and software tool confidence levels, are not covered. Nonetheless, we hope this article provides a starting point for semiconductor professionals and researchers to understand industrial practices and functional safety challenges. There is an urgent need to advance research beyond the current scope in order to develop fail-operational systems for future autonomous driving.

Original text: Practices and Challenges for Achieving Functional Safety of Modern Automotive SoCs.

Author: Wen Chen, Jayanta Bhadra.

Translation: Xu Yiyi Liu Zhaozhao Guo Jin

Translation review: Ross Kang, Erik Tang

Source: Public account: SASETECH

Latest articles about

■Understanding the OSI Model Using Logistics

■Talk about the controversy of the maximum load of 375 kg for new energy vehicles

■What is the car moose test?

■Live Preview | AUTOSAR SOME/IP Technology Interpretation

■AP AUTOSAR Hard-Core Technology (5): Diagnostic Management

■How much does it cost to customize an automotive-grade ECU?

■Live broadcast today | CAN XL International Seminar

■“Customers are not afraid, so what are you afraid of?” - Reflection on the value of static analysis of automotive software

■Detailed explanation of the control algorithm of the electromechanical brake system (EMB) - Taking Tongyu Automobile and Feige Intelligent as examples

■[Opening this week] SAE-AWC 2024 Automotive EEA Innovation Technology Forum | Free registration