SRAM expansion problem, how to solve it?

Latest update time：2024-02-16

Reads：

????If you want to see each other often, please mark it as a star???? and add it to your collection~

Source : Content compiled from semiengineering by Semiconductor Industry Observation (ID: ic bank ) , thank you.

SRAM's inability to scale challenges power and performance goals, forcing the design ecosystem to come up with strategies ranging from hardware innovation to rethinking design layouts. Meanwhile, SRAM has become the workhorse memory for artificial intelligence, despite the age of its original design and its current expansion limitations.

There have always been trade-offs between SRAM and DRAM. The most common configuration of SRAM is six transistors, which is faster to access than DRAM, but requires more energy to read and write. In contrast, DRAM uses a one-transistor/one-capacitor design and therefore costs less. But DRAM performance is compromised because the capacitors need to be refreshed due to charge leakage, and sometimes spontaneously refresh when the memory heats up. As a result, SRAM has been the memory of choice in applications where low latency and reliability are a priority for more than 60 years since its introduction.

Figure 1: SRAM cell size shrinks slower than the process, causing f2 to bloat. Source: Objective Analysis/Emerging Memory Technology Report

In fact, SRAM’s role goes far beyond itself when it comes to AI/ML applications. "SRAM is critical for artificial intelligence, especially embedded SRAM. It is the highest performance memory and can be integrated directly with high-density logic. For these reasons alone," said Tony Chan Carusone, CTO of Alphawave Semi. SRAM is very important.

Power consumption and performance challenges

However, SRAM has struggled to keep up with CMOS expansion, impacting power consumption and performance. "In traditional planar device scaling, gate length and gate oxide thickness are simultaneously reduced to improve performance and control short channel effects. Thinner oxide enables performance improvements at lower VDD levels , which is very beneficial to SRAM in reducing leakage and dynamic power consumption," said Jongsin Yun, EDA memory technology expert at Siemens. "However, in recent technology node migrations, we have seen little further scaling of oxide or VDD levels. Additionally, shrinking transistor geometries have resulted in thinner metal interconnects, which increases parasitic resistance, resulting in more Power loss and RC latency. As AI designs increasingly require internal memory access, how SRAM can further extend its power consumption and performance advantages during technology node migration has become a major challenge. "

These issues, coupled with the high cost of SRAM, inevitably lead to performance compromises. Beyond relying exclusively on SRAM, there's a whole hierarchy of memory/storage options, starting with off-chip DRAM, which comes in various speeds and architectural configurations.

"If you can't get enough SRAM to meet the data storage needs of the processor core, then the core will eventually have to move data from farther away," said Steve Woo, a researcher and distinguished inventor at Rambus. . "Moving data between SRAM and DRAM requires additional power, so the system consumes more power. And accessing data from DRAM takes longer, so performance degrades."

On each new node, the situation may not improve, and may even get worse.

“Looking towards nanosheets, SRAM size scaling is expected to be small,” said imec DTCO program director Geert Hellings. “One could argue that replacing the fins with nanosheets (~15nm wide) would be beneficial if all other process/layout margins remain the same. tiles (5nm wide) would add 40nm to the SRAM bit cell height (4 fins each), which is obviously not a great value proposition, so "flanking" process/layout margin improvements are expected to offset this. However, scaling SRAM from finfets to nanosheets is an uphill battle."

Flex Logix has worked on several of the lowest nodes, including TSMC's N7 and N5, and recently received a PDK for Intel's 1.8Å node, and is very familiar with these challenges. “Our customers working on advanced nodes are complaining that the logic scales better and faster than SRAM,” said Geoffrey Tate, CEO of Flex Logix. “This is a problem for the processor because the cache memory is larger than the entire processor. But if you don't put it on the chip, your performance plummets like a stone."

TSMC is hiring more memory designers to increase SRAM density, but whether they can get more out of SRAM remains to be seen. “Sometimes you can make things better by hiring more people, but only up to a point,” Tate said. “Over time, customers will need to consider not using SRAM as heavily as they do now ”

The fact that SRAM cannot scale with logic as early as 20nm bodes well for the power and performance challenges that will arise when on-chip memory may be larger than the chip itself. To deal with these problems, system designers and hardware developers are applying new solutions and developing new technologies.

Along these lines, AMD has taken a different approach. “They have introduced a technology called 3D V-Cache that allows additional SRAM cache to be stacked on separate chips, thereby increasing the amount of cache available to the processor core,” said Rambus’ Woo. “Extra chips Another strategy that adds cost but allows access to additional SRAM is to use multi-level cache. Processor cores can have private (non-shared) level 1 and level 2 caches that only they can access, as well as between processor cores. Larger shared last-level cache (LLC) Since the processor has so many cores, the shared LLC allows some cores to use more capacity at times and some cores to use less capacity, and therefore all processor cores. The total capacity is utilized more efficiently.”

error correction

Expansion also increases reliability issues. Cheng Wang, CTO of Flex Logix, said: "SRAM has traditionally been used in more aggressive and smaller sizes than logic cells, but unlike traditional logic gates, there is never contention and you are always writing a new value." "You have to overcome the current value. But you want it to hold its value when you're not writing. So you have a dilemma, in normal operation it can't be too weak to hold its value, But when you write it, you want it to be weaker. Because SRAM only has 6 transistors, you can't add a lot of gates to make it weaker when you write, and you can't make SRAM stronger when you don't write. Too small, as this can lead to single-event disturbances (seu) from problems such as alpha particles, where the energy of the ions overwhelms the energy in the SRAM cell. This will happen as SRAM shrinks. More."

Therefore, error correction may become a common requirement, especially for automotive equipment, Wang said.

Tate said SEU has become an issue at lower nodes, so radiation hardening technology previously only used in mil/aero applications is being used in SRAMs N5 and below. However, since RF hardening can add 25% to 50% to the cost, it is only possible for devices such as pacemakers, where no one can afford to wait for a reboot.

"Probably in 10 years, everything will have to be carefully engineered. You can't keep making storage elements smaller and smaller." "We're not getting rid of alpha particles."

Basic approach: trade-offs

这在设计方面造成了许多变化。“每个人都试图在芯片上使用更少的ram，因为它们不会变得更小，”王说。“但你使用ram的带宽，所以只要带宽在那里。随着芯片变得越来越大，高容量带宽内存将被从芯片上推到DDR，但你仍然会有大块较小的高带宽内存，即SRAM。”

设计师采用的另一种方法是尽可能只使用单核内存。“在旧的进程节点中，当我们写寄存器文件时，更有可能使用双核内存，”他说。“但所有这些也增加了面积。因此，在较低的节点中，设计人员试图使所有操作都通过内存中的单个端口进行，因为这些是可用的最小，最密集的全功率选项。他们不一定要远离SRAM，但他们尽可能地尝试使用单核内存。他们试图使用更小的内存，并选择SRAM的可用带宽，而不是真正的大存储。如果你能负担得起延迟，大型存储要么转移到DRAM，要么转移到HBM，如果你能负担得起成本。”

替代方法:新的体系结构

Yun表示，为了不断提高SRAM的功耗性能，已经对比特单元设计之外的许多更新进行了评估和应用，包括SRAM外围设计中的附加支持电路。

SRAM和外围不再共享它们的权力。相反，采用双电源轨来单独利用最有效的电压水平，”西门子的Yun说。在某些设计中，SRAM可以进入休眠模式，仅施加最小电压来保留数据，直到下一次从CPU访问。这带来了显著的功率优势，因为泄漏电流与VDD呈指数相关。一些SRAM设计包含额外的电路来解决操作弱点，旨在提高最低工作电压。”

例如，高密度(HD) SRAM单元可以通过在所有6个晶体管中使用单鳍晶体管来实现最小的几何形状。然而，由于在写入操作过程中相同尺寸的上拉(PU)和通栅(PG)晶体管之间的竞争问题，HD单元在低压操作中面临挑战。

Yun说:“在SRAM辅助电路中，如负位线，瞬态电压崩溃技术被广泛采用，以缓解这些问题并增强低电压运行。”为了减轻寄生电阻效应，最新的位单元设计采用双轨或三轨金属线作为合并位线(BL)或字线(WL)。飞行BL法根据运行情况有选择地连接金属轨道，降低了有效电阻，使阵列上下放电速率均衡。在正在进行的开发中，正在探索一种埋地电源轨，以进一步减少布线电阻。这包括将所有电源轨放置在晶体管下方，以缓解晶体管上方的信号路径拥塞。”

其他的记忆，其他的结构

新的嵌入式内存类型经常作为SRAM的替代品出现，但是每种类型都有自己的问题。“领先的竞争者，MRAM和ReRAM，只占用一个晶体管面积，”Yun说。“虽然它比SRAM中的晶体管要大，但它们的整体单元尺寸仍约为SRAM的三分之一，而包括外围电路在内的成品宏观目标尺寸约为SRAM的一半。”有明显的大小优势，但写入速度的性能仍然比SRAM慢得多。在实验室中，有几个有希望的写入速度和耐久性成就，但高速MRAM的开发计划已经延长到用于汽车的eflash替代MRAM的生产。L3缓存替换的尺寸优势当然值得考虑，但在此之前，eflash类型MRAM的生产必须有所增加。”

如果物理不允许更小的SRAM，那么替代方案将需要重新考虑架构并采用小芯片。“如果SRAM不能在N3或N2中扩展，那么可以将更先进的逻辑芯片与用旧技术制造的SRAM芯片结合起来，”imec的Hellings说。“这种方法将受益于改进的PPA逻辑，同时使用具有成本效益的(旧的，可能更高的产量和更便宜的)SRAM技术节点。原则上，AMD的基于v -cache的系统可以看到一个扩展，其中只有逻辑芯片被移动到下一个节点。然后需要使用3D集成或芯片方法(2.5D)将两个模具组合在一起。”

Ambiq的首席技术官Scott Hanson指出，芯片解决方案正适合正在进行的集成革命。“模拟电路很久以前就停止了缩放，除了少数例外，它并没有从缩放中获得很大的好处。所有类型的存储器，从DRAM到SRAM再到NVM，出于功耗、性能和成本的考虑，都倾向于在不同的节点生产。逻辑更倾向于在最小的节点上制造，这仍然满足成本和泄漏要求。通过多芯片集成，我们在“理想”技术节点上制造每个电路，然后将芯片组合成单个封装。许多人在移动和数据中心领域都听说过这种情况，但在终端人工智能和物联网领域，这种情况也在迅速发生。”

在有限的情况下，系统技术协同优化(STCO)也可以提供帮助。“对于一些应用来说，片上缓存原则上是不需要的，”Hellings说。“例如，在人工智能训练中，训练数据只使用一次，而模型参数应该很容易在芯片上访问。软件和芯片架构挂钩可以促进这种一次性数据移动，绕过缓存层次结构，具有很大的潜力。”

所有这些都激发了人们对新的布局和互连协议的兴趣，比如UCIe和CXL。新思科技战略营销经理罗恩•洛曼(Ron Lowman)表示:“当你有更大的人工智能工作负载时，内存会随着计算而扩展，但如果其中一个组件的扩展速度比另一个快一点，那么根据系统的设计方式，你会遇到不同的瓶颈。”“人工智能工作负载大大增加了所需处理器阵列的数量。他们甚至突破了模具的极限，所以现在你需要像UCIe这样的高速互连来实现模对模系统，这意味着多模系统不可避免地要处理人工智能工作负载。”

一个新的堆栈来解决这个问题

Winbond通过其CUBE堆栈(定制的超带宽元素)重新思考了内存架构。“我们正在使用DRAM作为存储单元，但也通过过孔进行3D堆叠，”Winbond的DRAM营销经理Omar Ma解释说。“基本上，你可以提供从底部基板一直到SoC芯片的连接。由于DRAM不使用SRAM的6个晶体管，因此它更具成本效益。”

CUBE可以提供足够高的密度，通过3级缓存取代SRAM。“为了达到一定的带宽要求，只有两种选择——提高时钟速度或增加I/ o数量，”马解释说。“使用CUBE，您可以随心所欲地增加它们，同时减少时钟。这在系统层面带来了很多好处，包括减少对电力的需求。”CUBE目前处于原型阶段，预计将于2024年第四季度或2025年初投入生产。

结论

在短期内，实用主义可能会战胜剧烈的设计变化。“这将是一个渐进的过程，”Flex Logix的Tate说。“这不会是戏剧性的。当设计师讨论他们应该有多大的缓存时，这将是性能和价格之间的平衡。如果SRAM的价格上涨，他们会决定他们可以比以前更少地生活，他们会在其他地方付出一些性能损失。也许他们会通过拥有更多的DRAM带宽来弥补。现在，这将是一种增量权衡。你不会很快看到完全不同的架构。但如果这种趋势继续下去，将导致人们考虑完全不同的方法。”

As for SRAM being completely replaced, that seems unlikely, at least in the short term. "A few years ago, Intel demonstrated the use of ferroelectric memory for cache," said Jim Handy, general manager of Objective Analysis. "They say it's DRAM, but let's be honest, it's FRAM. They say the advantage is that they're able to use 3D NAND technology to make it very compact. In other words, they're showing a very small space with a lot of memory in it. It's very It's possible that one of these efforts - either like what Intel demonstrated or another approach like MRAM - will eventually take SRAM's place, but that probably won't happen anytime soon."

When it does happen, Handy expects it will result in changes to architecture and operating system software. "You're unlikely to see the same processor having both an SRAM cache and a ferroelectric cache because the software would have to go through some changes to take advantage of that," he said. "Also, the cache will be structured differently. The main cache may be Shrink it a bit, and the side cache will get really big, and at some point the last processor with an SRAM cache will come out, and the next processor will have a ferroelectric or MRAM cache, or something like that, while doing that with software. Major modifications to make the configuration work better."

??????? Click [Read the original text] at the end of the article to view the original text link!

Click here to follow and lock in more original content

END

*Disclaimer: This article is original by the author. The content of the article is the personal opinion of the author. The reprinting by Semiconductor Industry Watch is only to convey a different point of view. It does not mean that Semiconductor Industry Watch agrees or supports the view. If you have any objections, please contact Semiconductor Industry Watch.

Today is the 3676th issue shared by "Semiconductor Industry Observation" with you. Welcome to pay attention.