Broadcom launches major product to lead the PCI-e switch and retimer market

Publisher:心怀梦想Latest update time:2024-03-07 Source: nextplatformKeywords:Broadcom Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

Server design would be much better off if we had a two-year or even four-year moratorium on adding faster compute engines to the machines, so that we could allow the memory subsystem and the I/O subsystem to catch up and take better advantage of those compute engines, and require fewer memory banks and I/O cards to meet the needs of those compute engines.


In fact, the move from 8 Gb/sec PCI-Express 3.0 (whose specification was released in 2010) to 16 Gb/sec PCI-Express 4.0 interconnect (which was supposed to be released in 2013) was delayed by four years, entering the field only in 2017. The main reason was the impedance mismatch between the I/O bandwidth that the compute engines really needed and the bandwidth that the PCI-Express interconnect could provide.


This mismatch has continued, causing PCI-Express to fall behind forever. This in turn forced companies to provide their own interconnects for their accelerators instead of using the general PCI-Express interconnect, which would open up server designs and provide a level I/O playing field. For example, Nvidia created the NVLink port, and the NVSwitch switch, and then the NVLink Switch fabric to connect memory on GPU clusters together, and ultimately connect the GPU to its "Grace" Arm server CPU. AMD also created the Infinity Fabric interconnect to connect CPUs together and then connect CPUs to GPUs. This interconnect standard has also been used inside the CPU to connect chiplets.


We've heard that Intel was hesitant on PCI-Express 4.0 after having issues with the integrated PCI-Express 3.0 controllers on some of its Xeon processors more than a decade ago, but we think it's fair to acknowledge that the transition to PCI-Express 4.0 has other technical issues to address, just as the Ethernet roadmap had issues above 10 Gb/s and couldn't jump directly to 100 Gb/s, having to take a step-by-step approach to 40 Gb/s before hyperscalers and cloud builders (and chip vendors including Broadcom and Mellanox) convinced the IEEE to adopt cheaper 25 Gb/s channel signaling.

image.png

Things happen, and the PCI-Express roadmap is one of them. As you saw from our coverage of the start of work on the PCI-Express 7.0 specification last year:


We believe that the cadence of the PCI-Express roadmap for peripheral cards, retimers, and switches needs to match the cadence of compute engine releases, and according to the spec, we do need PCI-Express 7.0, which won’t even be ratified until next year. But given that PCI-Express 6.0 is the first generation to use PAM-4 signaling and FLIT low-latency encoding, it would be possible to jump directly from PCI-Express 5.0 with well-established NRZ signaling to the faster PAM-4/FLIT combination that PCI is looking forward to.


No one knows the bus better than Broadcom, which, thanks to Avago’s acquisition of PLX Technologies in June 2014 and Avago’s acquisition of Broadcom in May 2015, makes PCI-Express switches and retimers that extend the range of copper wires plugged into them. The company is preparing its “Atlas 3” generation of PCI-Express switches and retimers, which are based on the “Talon 5” family of SerDes that implement PAM-4 signaling. The Talon 5 SerDes is related to, but different from, the “Peregrine” PAM-4 SerDes used in the “Tomahawk 5” and “Jericho 3-AI” families of Ethernet switch ASICs, because PCI-Express is an absolutely lossless protocol and therefore has stricter low-latency requirements.


To help server makers and peripheral manufacturers move in the same direction, Broadcom began publishing its PCI-Express switch and retimer roadmap.

image.png

Interestingly, Broadcom was going to exit the retimer business but was pulled back into it by its customers and partners, and technically this story is about revealing some details about the Vantage 5 and 6 series of PCI-Express retimers.


Jas Tremblay, vice president and general manager of Broadcom’s Data Center Solutions Group, told The Next Platform: “We always expected retimers to be a companion chip to switches. We believed PCI-Express Gen 5 retimers would be a commodity product and that there would be three or four vendors that would successfully bring these products to market. So we focused all our efforts on switches and other higher complexity PCI-Express 5.0 products. But we were completely wrong. Customers are coming back to us because retimers are harder than anyone thought. Of course, we have to make sure the switches and retimers work and they are very reliable, but we actually have to make sure it’s instrumented so that we can help system providers and cloud providers pinpoint what’s happening in the equipment.”


Retimers are an increasingly important part of the PCI-Express hardware and firmware stack. First, servers are becoming more complex than they were twenty years ago, when PCI-Express 3.0 ruled the world and all we needed was more lanes, not faster lanes (according to Intel). Take a look at the difference between a general-purpose server and an AI server when it comes to the PCI-Express interconnect inside a node:

image.png

We've gone from point-to-point interconnects (hanging off the PCI-Express bus) with a handful of devices (a disk controller or two, a network controller, and maybe a few other specialized peripherals) to what amounts to a PCI-Express switch fabric connecting the CPU to accelerators, network interfaces, flash memory, and, soon, CXL expansion memory. Perhaps one day we might see CPUs linked by PCI-Express links running the CXL protocol instead of a proprietary NUMA interconnect, or, more likely, a proprietary overlay running on top of PCI-Express/CXL, as it looks like AMD and Broadcom are working on for future CPUs and GPUs.


But there is another problem, and this is where the retimer comes in.


Every time bandwidth doubles, the distance a PCI-Express signal has to travel over copper wire is halved. Retimers are used to extend the length of the copper wire; the longer the wire, the more retimers are needed. Because of latency issues and the fact that PCI-Express acts as an extension of the CPU bus, it is important not to let the cables get too long. However, if you want to extend your PCI-Express fabric across multiple racks, or even a whole row of racks, the need for retimers will grow as PCI-Express grows and bandwidths get higher.


Broadcom's latest Vantage 5 and Vantage 6 retimers add only 6 nanoseconds of latency to extend the length of PCI-Express 6.0 signals, which seems to be a fairly low overhead considering that this PCI-Express fabric will eliminate the need for InfiniBand or Ethernet. Broadcom, as a founding member of the Ultra Ethernet Alliance, is committed to making Ethernet better than InfiniBand, and its Jericho 3-AI deep buffer switch ASIC is the first step in this effort, hoping to enable PCI-Express switching within the node, Ethernet can better span the rack, and PCI-E can connect inside the server. But much depends on the level of retimers and switches used to build the cluster, as well as the overall cost of the PCI-Express fabric compared to InfiniBand and Ethernet.


The internal structure of a modern AI server looks like this:

image.png

This is a block diagram of the "Grand Teton" AI server, which Meta Platforms launched as an Open Compute Project design back in October 2022.


In this Grand Teton server, there is a pair of PCI-Express switches that connect the four CPUs to the Ethernet NICs, flash memory, and CXL memory, with retimers to increase the connection length between peripherals and switches. There is another pair of PCI-Express ASICs that interconnect the eight GPUs so that they can share memory. Grand Tetons are based on PCI-Express 5.0, namely the Vantage retimer and Atlas 2 switch ASICs. Each of these ASICs has 114 PCI-Express 5.0 signal lanes, which can produce an aggregate bandwidth of 57 TB/sec across all devices.


Here is a different (and perhaps more accurate) block diagram showing how switches and retimers are used in the Grand Teton system, taken from its OCP specification:

image.png

These specifications are not yet available in October 2022.


As you can see, the retimers are actually used to extend the link between the PCI-Express switch and the GPU, while the other peripherals are linked directly to the PCI-Express switch. This is not exactly what the Broadcom diagram implies. The number of switches and retimers is the same, but the topology is different. Also, the NVSwitch interconnect is still used to link the GPUs to each other, although there is a secondary PCI-Express 4.0 switch that connects the GPU to one of the PCI-Express 5.0 switches, perhaps as a management interconnect or as a means to send data back to the CPU without going back through the retimer. It's an interesting diagram.


Back to the retimers. Here are the salient features of the Vantage 5 and Vantage 6 retimers used with the PCI-Express 5.0 ("Atlas 2") and PCI-Express 6.0 ("Atlas 3") switch ASICs:

image.png

Because they drive 64 Gb/sec PAM-4 signaling, the Vantage 6 retimers run slightly hotter than the Vantage 5 retimers, which only perform 32 Gb/sec NRZ signaling. Both retimers use the same Talon 5 SerDes, which supports either signaling method, and both Vantage chips are built on TSMC's 5nm process.


It’s unclear why Vantage 6 doesn’t have channel performance specs when connected to Broadcom SerDes using 64 Gb/sec PAM-4 signaling. Perhaps Broadcom is holding back on that information for now. Clearly Broadcom wants to provide customers with end-to-end connectivity and even wants to try to replace NVSwitch in some designs, as hyperscalers, cloud builders, and HPC centers around the world want to do once PCI-Express can do the job.


Tremblay said that when the Talon 5 SerDes is used on retimers and switches, the combination of its chips can extend reach by 40%, providing a signal that is 12db better than the signal required by the PCI-Express 5.0 specification. The Talon 5 SerDes architecture, combined with a 5-nanometer process (compared to the 7-nanometer process of competing PCI-Express switches and retimers), also reduces power consumption by 50%.

[1] [2]
Keywords:Broadcom Reference address:Broadcom launches major product to lead the PCI-e switch and retimer market

Previous article:Introduction to CAN bus bit timing
Next article:[Molex] New Product Express | ZN Stack 0.50mm Terminal Pitch Floating Board-to-Board Connector

Latest Embedded Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号