To unleash the potential of large AI models, hardware computing power urgently needs to break through the interconnection bottleneck.
With the emergence of ChatGPT, the potential of pre-trained large models to innovate and transform thousands of industries has been fully demonstrated. Some industry leaders even hailed its emergence as the "iPhone moment" of artificial intelligence and predicted that it will be "just something greater." Start".
Why is ChatGPT so “different”?
Borrowing from the summary of Harvard professor Venky Narayanamurti, the proposer of Technology Acceptance Theory (TAM): usefulness and ease of use are the two basic prerequisites for the diffusion of an emerging technology. By this standard, the 2016 "AlphaGo vs. Lee Sedol" five-game chess match can be regarded as the completion of national science popularization of the "usefulness" of artificial intelligence, and ChatGPT marks another necessary condition for the spread of AI technology - easy Utility has subtly seeped into the public mind.
Based on this, the outside world has reason to be optimistic and expect that the artificial intelligence industry has indeed stood at a new starting point in the grand blueprint predicted by the giants.
Is computing power trapped in the interconnection?
In many "hindsight" interpretations, the GPT family is often associated with the Transformer model launched by Google in 2017.
Transformer based on the self-attention mechanism, and later Google BERT's "best-in-class" performance and amazing generalization ability on various text tasks, can indeed be regarded as the foundation for GPT's pre-processing technology and engineering methods. Standing on the shoulders of giants, the GPT developer OpenAI team finally made the final leap with more agile efficiency and more powerful execution.
Looking further, Transformer achieves more efficient use of hardware computing power than previous traditional deep learning models such as MLP\LSTM.
Regarding this point, Richard Sutton, the dean of deep learning and chief scientist of DeepMind in 2019, said in his article "Bitter Lessons" that the biggest lesson that can be learned from 70 years of AI research is that the general method of using calculations is ultimately the most effective. Effective, and by a large margin, the ultimate reason is Moore's Law, or rather its generalization of the continuing exponential decline in cost per unit of computation. Most AI research is conducted with constant available computing power. (in which case leveraging human experience would be the only way to improve performance), however, over a slightly longer period than a typical research project, the computing power will increase significantly and in the long run the only thing that matters is leveraging calculate .
As Sutton predicted, AI hardware computing power has made rapid progress in recent years. In addition to the well-known GPU, AI acceleration chips such as Google TPU (the GPT series was trained through TPUv2 in the early days) and Microsoft Catapult often eliminate the need for advanced control measures such as out-of-order and prefetching in traditional general-purpose CPU microarchitecture, and can achieve volume processing. The design of the core multiplication and accumulation unit of the product neural network is optimized to fully exploit the parallel computing capabilities of the SIMD architecture.
Advances in algorithms, computing power, and their combined engineering methods ultimately laid the foundation for OpenAI to “make miracles happen with great force.”
It is completely expected that under the star effect of OpenAI, global technology giants will launch a series of GPT-like pre-trained large models in the next year or two, which is also expected to further accelerate investment in data center AI computing clusters.
However, it is worth noting that although major chip manufacturers are rushing to launch AI acceleration chips and computing power parameters continue to set new records, the number of pre-trained large model parameters often exceeds tens of billions, hundreds of billions or even trillions, and its training is still far beyond one The range that can be controlled by AI accelerator cards such as two GPUs often requires interconnecting multiple processors through the network and even further forming an HPC computing cluster to achieve pooled scheduling of computing resources, so as to meet the distribution of large AI models. formula, parallelized training. When evaluating training efficiency, the total training time of a single batch of data is often significantly affected by the communication duration.
Because of this, with the emergence of a new imagination space revealed by large AI models, the infrastructure of computing power clusters will also usher in an investment boom. Among the series of engineering challenges it faces such as power distribution, heat dissipation, and communications, computing power will Data transmission between nodes is particularly a key "bottleneck" that restricts the full release of hardware computing power .
The "key pivot" to break interconnection bottlenecks
The amount of data faced by AI training and inference has increased exponentially, so whether it is multiple GPUs in a single server, C2C communication between CPUs, or networking between multiple servers, data transmission generally presents technical requirements for high bandwidth and low latency.
In the context of the convergence of computing cluster communication network topologies, switch interfaces have increasingly become an important breakthrough to break through "bottlenecks", and various engineering ideas have been derived such as increasing the network card speed, increasing the number of network cards, and even applying RDMA network direct connection.
In the field of underlying interface technology, compared with traditional parallel interfaces, SerDes serial interfaces have become mainstream applications due to their significant cost advantages. In new standards such as PCIe 6.0, PAM4 (Level 4) is further introduced in the physical layer. Pulse Amplitude Modulation) encoding support to further increase SerDes data transfer rates.
However, there are naturally many technical challenges in the application of SerDes, the most serious of which is undoubtedly the signal integrity (SI) issue.
For example, in medium-distance and long-distance interconnection scenarios through backplanes, connectors, and PCB boards, the TX and RX ends of SerDes high-speed links are often separated by pins, PCB vias, signal lines, and even connectors, cables, etc. , there is noise, crosstalk and signal attenuation introduced by complex reasons such as materials, processes, layouts, etc., so that the electrical signal that finally reaches the receiver may be severely distorted, making it difficult to recover the clock and data bits of the transmitted information, and also limits the speed and distance on the design space.
The new generation of 56G and 112G SerDes uses PAM4 encoding, which not only provides greater network throughput, but also introduces more levels, resulting in signal-to-noise ratio loss, bit error rate (BER) deterioration, forward error correction ( Problems such as increased latency (FEC) require careful trade-offs.
It is not difficult to see from the above analysis that in order to fully utilize the computing power of AI hardware, interface technology is a key fulcrum to break the interconnection bottleneck and has a huge leverage effect, while its application must solve many challenges surrounding signal integrity.
At present, although a large number of hardware manufacturers hire full-time SI engineers to be responsible for debugging, the effect depends on the widely varying personal "craftsmanship". Since signal integrity needs to be ensured at all levels of chip and system design, the verification process is often protracted. Only engineers with very high skills must ensure signal integrity. Only a skilled and extremely experienced design team in the field of analog electronics can attempt this type of design, and the test and verification cycle is long.
Because of this, in chip design, outsourcing high-speed interface IP has almost become a "must-have option" in the industry, which has also driven interface IP to become the fastest-growing IP market segment in recent years . According to predictions from professional institutions, interface IP is even expected to grow by 2025. It surpassed CPU IP and became the largest semiconductor IP category.
Give a man a fish, teach him how to fish
Interface IP market opportunities have also made it a hot spot for competition among major IP HOUSEs. Cadence has also launched a 112G SerDes IP design, which is oriented to the SOC requirements of high-performance computing (HPC) data centers. It is suitable for long-distance and medium-distance transmission. It has been verified by 7nm process silicon and has excellent PPA performance with an insertion loss of >35dB .
It is worth mentioning that outsourced interface IP is only the starting point from chip to system development and signal integrity testing. Whether the tools supporting the workflow are complete and accessible is also an important factor affecting the development cycle. It can be said that the interface IP supplier It is not only necessary to teach people to fish, but also to teach people to fish.
As a giant in the EDA\IP field, Cadence's practices in this area are particularly representative. In addition to mature interface IP such as SerDes, the company also provides comprehensive design tools and technologies that are organically integrated to help chip and system designers cope with signal integrity challenges at all levels.
For example, modeling is an essential means in the design and simulation of interconnections between different chips. Currently, IBIS and AMI are the preferred ways to model SerDes channels. The emergence of IBIS-AMI makes it possible to use simulation models to quickly and accurately simulate a large number of bit streams. Cadence is based on the Sigrity Advanced IBIS modeling tool, which allows users to automatically create models and use wizards to generate practical algorithm models .
In medium and long-distance interconnection scenarios based on PCB boards/backplanes/connectors, SerDes high-speed interface developers also need to conduct signal integrity (SI) and power integrity (PI) for the overall design in order to analyze signals accurately and reliably. As well as electromagnetic compatibility (EMC) co-simulation, developers often need to have a thorough understanding of data acquisition and analysis theory and accurately establish simulation device characteristic models.
To address this pain point, Cadence's Clarity 3D Solver provides a better tool selection for PCB, IC packaging and SoIC critical interconnect design. Based on the high-precision S-parameter model it creates, even at 112G data transmission speeds, Achieve simulation results that match laboratory measurements. Its finite element analysis (FEM) process is highly parallelized, which can greatly shorten the solution time and supports near-linear hardware computing power scalability.
Additionally, when analyzing link signal integrity, there is often an implicit assumption that the board and connector are functional, but at very high frequencies, the assumption is that the connector and board are analyzed separately and then "connected" together. No longer applicable. There are too many interactions between circuit boards and connectors, which also require comprehensive 3D analysis tools such as Clarity Solver to achieve high-quality designs while accurately predicting finished product performance.
language
After Huang Renxun called the "iPhone moment" of artificial intelligence, in just a few dozen days, pre-trained large models and their downstream applications have shown a dizzying explosion. It is conceivable that in the "arms race" of major cloud computing giants' AI large models, computing power clusters will also usher in a new round of investment boom, and communication network and interface technologies are also expected to enter a period of accelerated development. In addition, manufacturers taking the "super single chip" route such as Tesla Dojo and Cerebras WSE-2 may also lead a new path for large model training.
However, no matter what path is taken, the "rigorous need" for interface IP is clearly visible. In this hot and difficult area, Cadence will bring more complete solutions, promote the alleviation and loosening of interconnection bottlenecks, and effectively release the power of pre-training. The model has unlimited possibilities and benefits thousands of industries.
About Cadence
With more than 30 years of expertise in computing systems, Cadence is a key leader in the electronic systems design industry. Based on the company's intelligent system design strategy, Cadence is committed to providing software, hardware and IP products to help electronic design concepts become reality. Cadence's customers are the most innovative companies around the world, delivering everything from chips and circuit boards to the most dynamic application markets such as hyperscale computing, 5G communications, automotive, mobile, aerospace, consumer electronics, industrial and medical. A complete system of excellence in electronics. Cadence has been ranked among Fortune magazine's 100 Best Companies to Work For for nine consecutive years. For more information, please visit the company's website at cadence.com.
© 2023 Cadence Design Systems, Inc. All rights reserved. All rights reserved worldwide. Cadence, the Cadence logo and other Cadence marks listed at www.cadence.com/go/trademarks are trademarks or registered trademarks of Cadence Design Systems, Inc. All other marks are the property of their respective owners.