Server design would be much better off if we had a two-year or even four-year moratorium on adding faster compute engines to the machines, so that we could allow the memory subsystem and the I/O subsystem to catch up and take better advantage of those compute engines, and require fewer memory banks and I/O cards to meet the needs of those compute engines.
In fact, the move from 8 Gb/sec PCI-Express 3.0 (whose specification was released in 2010) to 16 Gb/sec PCI-Express 4.0 interconnect (which was supposed to be released in 2013) was delayed by four years, entering the field only in 2017. The main reason was the impedance mismatch between the I/O bandwidth that the compute engines really needed and the bandwidth that the PCI-Express interconnect could provide.
This mismatch has continued, causing PCI-Express to fall behind forever. This in turn forced companies to provide their own interconnects for their accelerators instead of using the general PCI-Express interconnect, which would open up server designs and provide a level I/O playing field. For example, Nvidia created the NVLink port, and the NVSwitch switch, and then the NVLink Switch fabric to connect memory on GPU clusters together, and ultimately connect the GPU to its "Grace" Arm server CPU. AMD also created the Infinity Fabric interconnect to connect CPUs together and then connect CPUs to GPUs. This interconnect standard has also been used inside the CPU to connect chiplets.
We've heard that Intel was hesitant on PCI-Express 4.0 after having issues with the integrated PCI-Express 3.0 controllers on some of its Xeon processors more than a decade ago, but we think it's fair to acknowledge that the transition to PCI-Express 4.0 has other technical issues to address, just as the Ethernet roadmap had issues above 10 Gb/s and couldn't jump directly to 100 Gb/s, having to take a step-by-step approach to 40 Gb/s before hyperscalers and cloud builders (and chip vendors including Broadcom and Mellanox) convinced the IEEE to adopt cheaper 25 Gb/s channel signaling.
Things happen, and the PCI-Express roadmap is one of them. As you saw from our coverage of the start of work on the PCI-Express 7.0 specification last year:
We believe that the cadence of the PCI-Express roadmap for peripheral cards, retimers, and switches needs to match the cadence of compute engine releases, and according to the spec, we do need PCI-Express 7.0, which won’t even be ratified until next year. But given that PCI-Express 6.0 is the first generation to use PAM-4 signaling and FLIT low-latency encoding, it would be possible to jump directly from PCI-Express 5.0 with well-established NRZ signaling to the faster PAM-4/FLIT combination that PCI is looking forward to.
No one knows the bus better than Broadcom, which, thanks to Avago’s acquisition of PLX Technologies in June 2014 and Avago’s acquisition of Broadcom in May 2015, makes PCI-Express switches and retimers that extend the range of copper wires plugged into them. The company is preparing its “Atlas 3” generation of PCI-Express switches and retimers, which are based on the “Talon 5” family of SerDes that implement PAM-4 signaling. The Talon 5 SerDes is related to, but different from, the “Peregrine” PAM-4 SerDes used in the “Tomahawk 5” and “Jericho 3-AI” families of Ethernet switch ASICs, because PCI-Express is an absolutely lossless protocol and therefore has stricter low-latency requirements.
To help server makers and peripheral manufacturers move in the same direction, Broadcom began publishing its PCI-Express switch and retimer roadmap.
Interestingly, Broadcom was going to exit the retimer business but was pulled back into it by its customers and partners, and technically this story is about revealing some details about the Vantage 5 and 6 series of PCI-Express retimers.
Jas Tremblay, vice president and general manager of Broadcom’s Data Center Solutions Group, told The Next Platform: “We always expected retimers to be a companion chip to switches. We believed PCI-Express Gen 5 retimers would be a commodity product and that there would be three or four vendors that would successfully bring these products to market. So we focused all our efforts on switches and other higher complexity PCI-Express 5.0 products. But we were completely wrong. Customers are coming back to us because retimers are harder than anyone thought. Of course, we have to make sure the switches and retimers work and they are very reliable, but we actually have to make sure it’s instrumented so that we can help system providers and cloud providers pinpoint what’s happening in the equipment.”
Retimers are an increasingly important part of the PCI-Express hardware and firmware stack. First, servers are becoming more complex than they were twenty years ago, when PCI-Express 3.0 ruled the world and all we needed was more lanes, not faster lanes (according to Intel). Take a look at the difference between a general-purpose server and an AI server when it comes to the PCI-Express interconnect inside a node:
We've gone from point-to-point interconnects (hanging off the PCI-Express bus) with a handful of devices (a disk controller or two, a network controller, and maybe a few other specialized peripherals) to what amounts to a PCI-Express switch fabric connecting the CPU to accelerators, network interfaces, flash memory, and, soon, CXL expansion memory. Perhaps one day we might see CPUs linked by PCI-Express links running the CXL protocol instead of a proprietary NUMA interconnect, or, more likely, a proprietary overlay running on top of PCI-Express/CXL, as it looks like AMD and Broadcom are working on for future CPUs and GPUs.
But there is another problem, and this is where the retimer comes in.
Every time bandwidth doubles, the distance a PCI-Express signal has to travel over copper wire is halved. Retimers are used to extend the length of the copper wire; the longer the wire, the more retimers are needed. Because of latency issues and the fact that PCI-Express acts as an extension of the CPU bus, it is important not to let the cables get too long. However, if you want to extend your PCI-Express fabric across multiple racks, or even a whole row of racks, the need for retimers will grow as PCI-Express grows and bandwidths get higher.
Broadcom's latest Vantage 5 and Vantage 6 retimers add only 6 nanoseconds of latency to extend the length of PCI-Express 6.0 signals, which seems to be a fairly low overhead considering that this PCI-Express fabric will eliminate the need for InfiniBand or Ethernet. Broadcom, as a founding member of the Ultra Ethernet Alliance, is committed to making Ethernet better than InfiniBand, and its Jericho 3-AI deep buffer switch ASIC is the first step in this effort, hoping to enable PCI-Express switching within the node, Ethernet can better span the rack, and PCI-E can connect inside the server. But much depends on the level of retimers and switches used to build the cluster, as well as the overall cost of the PCI-Express fabric compared to InfiniBand and Ethernet.
The internal structure of a modern AI server looks like this:
This is a block diagram of the "Grand Teton" AI server, which Meta Platforms launched as an Open Compute Project design back in October 2022.
In this Grand Teton server, there is a pair of PCI-Express switches that connect the four CPUs to the Ethernet NICs, flash memory, and CXL memory, with retimers to increase the connection length between peripherals and switches. There is another pair of PCI-Express ASICs that interconnect the eight GPUs so that they can share memory. Grand Tetons are based on PCI-Express 5.0, namely the Vantage retimer and Atlas 2 switch ASICs. Each of these ASICs has 114 PCI-Express 5.0 signal lanes, which can produce an aggregate bandwidth of 57 TB/sec across all devices.
Here is a different (and perhaps more accurate) block diagram showing how switches and retimers are used in the Grand Teton system, taken from its OCP specification:
These specifications are not yet available in October 2022.
As you can see, the retimers are actually used to extend the link between the PCI-Express switch and the GPU, while the other peripherals are linked directly to the PCI-Express switch. This is not exactly what the Broadcom diagram implies. The number of switches and retimers is the same, but the topology is different. Also, the NVSwitch interconnect is still used to link the GPUs to each other, although there is a secondary PCI-Express 4.0 switch that connects the GPU to one of the PCI-Express 5.0 switches, perhaps as a management interconnect or as a means to send data back to the CPU without going back through the retimer. It's an interesting diagram.
Back to the retimers. Here are the salient features of the Vantage 5 and Vantage 6 retimers used with the PCI-Express 5.0 ("Atlas 2") and PCI-Express 6.0 ("Atlas 3") switch ASICs:
Because they drive 64 Gb/sec PAM-4 signaling, the Vantage 6 retimers run slightly hotter than the Vantage 5 retimers, which only perform 32 Gb/sec NRZ signaling. Both retimers use the same Talon 5 SerDes, which supports either signaling method, and both Vantage chips are built on TSMC's 5nm process.
It’s unclear why Vantage 6 doesn’t have channel performance specs when connected to Broadcom SerDes using 64 Gb/sec PAM-4 signaling. Perhaps Broadcom is holding back on that information for now. Clearly Broadcom wants to provide customers with end-to-end connectivity and even wants to try to replace NVSwitch in some designs, as hyperscalers, cloud builders, and HPC centers around the world want to do once PCI-Express can do the job.
Tremblay said that when the Talon 5 SerDes is used on retimers and switches, the combination of its chips can extend reach by 40%, providing a signal that is 12db better than the signal required by the PCI-Express 5.0 specification. The Talon 5 SerDes architecture, combined with a 5-nanometer process (compared to the 7-nanometer process of competing PCI-Express switches and retimers), also reduces power consumption by 50%.
Previous article:Introduction to CAN bus bit timing
Next article:[Molex] New Product Express | ZN Stack 0.50mm Terminal Pitch Floating Board-to-Board Connector
- Popular Resources
- Popular amplifiers
- Red Hat announces definitive agreement to acquire Neural Magic
- 5G network speed is faster than 4G, but the perception is poor! Wu Hequan: 6G standard formulation should focus on user needs
- SEMI report: Global silicon wafer shipments increased by 6% in the third quarter of 2024
- OpenAI calls for a "North American Artificial Intelligence Alliance" to compete with China
- OpenAI is rumored to be launching a new intelligent body that can automatically perform tasks for users
- Arm: Focusing on efficient computing platforms, we work together to build a sustainable future
- AMD to cut 4% of its workforce to gain a stronger position in artificial intelligence chips
- NEC receives new supercomputer orders: Intel CPU + AMD accelerator + Nvidia switch
- RW61X: Wi-Fi 6 tri-band device in a secure i.MX RT MCU
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
- CGD and Qorvo to jointly revolutionize motor control solutions
- CGD and Qorvo to jointly revolutionize motor control solutions
- Keysight Technologies FieldFox handheld analyzer with VDI spread spectrum module to achieve millimeter wave analysis function
- Infineon's PASCO2V15 XENSIV PAS CO2 5V Sensor Now Available at Mouser for Accurate CO2 Level Measurement
- Advanced gameplay, Harting takes your PCB board connection to a new level!
- Advanced gameplay, Harting takes your PCB board connection to a new level!
- A new chapter in Great Wall Motors R&D: solid-state battery technology leads the future
- Naxin Micro provides full-scenario GaN driver IC solutions
- Interpreting Huawei’s new solid-state battery patent, will it challenge CATL in 2030?
- Are pure electric/plug-in hybrid vehicles going crazy? A Chinese company has launched the world's first -40℃ dischargeable hybrid battery that is not afraid of cold
- [NUCLEO-L552ZE Review] Unboxing and Onboard Resource Analysis
- GD32L233C-START Evaluation——02_2. Build development environment and simple debugging
- [Zhongke Bluexun AB32VG1 RISC-V board "encounter" RTT evaluation] 1: Fill the pit and run the first light-up program
- IAR platform transplants TI OSAL to STC8A8K64S4A12 MCU
- [Xianji HPM6750 Review] PWM Control Buzzer Sound
- Interface ov5640_camera
- 51 MCU 16_16 dot matrix example
- [NXP Rapid IoT Review] + How to import the project downloaded from WEB IDE into MCUXpresso IDE and debug it?
- EEWORLD University Hall----Engineering is smarter, industrial design is more powerful-field transmitter and smart meter design solution
- Startup interface kernel code modification