Exclusive interview with Biren Technology executives: Deconstructing the company’s first 7nm GPU
Recently, BiRen Technology, a domestic high-end GPU chip company, disclosed its first general-purpose GPU chip, BR100, at a product launch conference held recently.
According to Zhang Wen, founder, chairman and CEO of BiRen Technology, this chip built with 7nm technology has set a global computing power record, with a peak computing power that is more than three times that of the flagship products sold by international manufacturers, setting a domestic interconnection bandwidth record. It is also the first general-purpose GPU chip in China to adopt Chiplet technology, the first to adopt the new generation host interface PCIe 5.0, and the first to support the CXL interconnection protocol.
"The official release of BR100 marks the first time that a global general-purpose GPU computing power record has been set by a Chinese company. China's general-purpose GPU chips have officially entered a new era of 'petaflops of calculations per second.'" Zhang Wen emphasized.
Because of its strong technical team and strong capital appeal, BiRen Technology has received great attention from the industry since its establishment three years ago. Now, after more than a thousand days and nights, they have finally brought their highly competitive products. As mentioned above, BiRen Technology also has many innovative designs on this chip.
In order to help everyone understand the chip of Biren Technology and the logic of the team’s product design, Semiconductor Industry Observer reporters recently interviewed many executives of Biren Technology, and together they deconstructed Biren Technology’s first 7nm GPU, and tried to reveal to everyone This GPU upstart’s thoughts on future product planning.
Use 7nm to challenge 4nm’s hardware confidence
According to BiRen Technology, the BR100 manufactured on TSMC's 7nm process has a chip area of 1074mm2 and an integrated number of 77 billion transistors. Such data results are closely related to the company's use of Chiplet and 2.5D CoWoS advanced packaging technology in chip design with a powerful original architecture, while taking into account high yield and high performance. Thanks to this design, the performance of BR100 is comparable to Nvidia's 4nm chip H100 released in 2022. Compared with the latter's 7nm chip A100 released in 2020, BR100 can achieve a three-fold performance improvement.
In order to give you a more intuitive understanding of the superiority of the BiRen Technology BR100 design, we provide you with some data reference. Nvidia's 7nm A100 has a chip area of 828 mm2 and a transistor count of 53.2 billion; its 4nm H100 has a chip area of 814 mm2 and a transistor count of 80 billion.
When asked how they achieve comparable performance to Nvidia's advanced process chips using relatively backward processes, Hong Zhou, co-founder and CTO of BiRen Technology, told reporters: "Our completely independent original architecture, advanced packaging technology, ultra-large chip scale and abundant on-chip cache give us the confidence."
"In terms of micro-architecture, we focus on the design of general-purpose computing cores, use a powerful tensor computing engine to accelerate calculations, and use self-developed instruction sets to realize the operation of various functions more efficiently. The self-developed GPGPU architecture and instruction set are matched The multi-level storage architecture can realize data reuse under large model training, and the NoC-based communication architecture can realize data multicast function, which can greatly reduce the demand for off-chip bandwidth and significantly reduce power consumption."
It is reported that the architecture named "Biren" of Biren Technology is centered on data flow and deeply optimizes the data flow. Through six major technical features, it has relatively completely solved the bottleneck of data movement and the problem of insufficient parallelism, enabling the BR100 chip to achieve a leap forward in performance and energy efficiency under a given process.
Mention advanced packaging options. Hong Zhou said that as far as the chip itself is concerned, the main reason for adopting new technologies such as Chiplet and 2.5D CoWoS advanced packaging is to achieve the performance and cost relationship of system-level chips and continue to maintain the "economic benefits" of Moore's Law: On the one hand , Chiplet can improve the yield rate of large chips. By "splicing" two small chips into one large chip, it can reduce the cost of damage caused by yield rate; on the other hand, using CoWoS can increase the interconnection density and maximize the realization of SoC , HBM, and Die-to-Die interconnect speeds between multiple chips.
"The product form can also be enriched by using chiplets. For example, the BR104 released by Biren Technology this time is a single-die product, while the BR100 is a dual-die product using chiplet technology. Two products are formed in one tapeout, each with its own characteristics. advantages and focus, covering a wider range of application markets,” Hongzhou added.
In addition to innovations in chip architecture and design, BiRen Technology also provides rich support for BR100 interfaces, such as support for PCIe 5.0 interface technology and CXL communication protocol, with a bidirectional bandwidth of 128 GB/s; original BLINK high-speed GPU interconnection technology, with a single-card interconnection bandwidth of 512 GB/s, and support for full interconnection of 8 cards in a single node.
According to Hong Zhou, CXL can solve the memory resource usage problem between CPU and GPU, GPU and GPU. Especially for computing servers, there is a huge memory gap between limited memory and a large number of computing devices. CXL is currently the most effective agreement. As for PICe5.0, because Intel, AMD, etc. will launch server CPU chips based on PICe5.0, Intel's next-generation Sapphire Rapids platform will also support both PCIe5.0 and CXL, which is about the same time as the launch of Biren Technology's products.
"In terms of multi-card interconnection, by using high-speed SerDes technology called BLINKTM, BR100 can support multiple ports and can achieve point-to-point interconnection of 8 cards to meet the needs of data exchange between multiple cards for large-scale AI training." Hong Zhou told reporters. It is worth mentioning that its speed of 512 GB/s has set a record for domestic interconnect bandwidth.
BiRen Technology has integrated 32 SPCs on BR100, each of which has 16 EUs (Execution Units), and every 4 EUs can be configured as 1 CU (Compute Unit), with a total of 4096 threads per SPC. Each EU also has 16 general-purpose stream processors and a dedicated tensor engine T-Core that uses a pulsating 3D GEMM architecture.
In general, BR100 has a huge scale of parallel computing resources, including 8192 general-purpose stream processors, 512 groups of dedicated tensor acceleration engines, 128K threads and 256MB distributed shared L2 cache, etc. It is worth mentioning that L2 can also support data sharing between multiple SPCs (8MB/SPC) and near memory computing at different levels, and can be configured as a large-capacity scratchpad.
The Biren Technology team also fully foresees customers' video processing-related needs - the number of video processing channels on a single chip. Therefore, they embedded rich codec configurations in BR100 from the beginning of the design. Its single-chip version can support 32 channels of encoding and 256 channels of decoding, and the dual-chip version can support 64 channels of encoding and 512 channels of decoding (each channel of FHD @30fps), which can greatly reduce the cost per channel of video processing.
In addition to the above features, BR100 has several highlights that cannot be ignored: for example, in terms of the memory system, it is equipped with 64GB HBM2E off-chip memory, with a total bandwidth of 1.64TB/s, and equipped with more than 300MB on-chip high-speed SRAM; in terms of multi-precision support, in addition to native support for mainstream data precisions such as FP32, BF16, FP16, INT8, it originally defines TF32+ data precision, providing higher data precision and throughput performance compared to TF32; in terms of secure virtual instances (SVI), it can support up to 8 independent instances, each instance is physically isolated and equipped with independent hardware resources, and can run independently; in terms of national secret security specifications, dedicated hardware encryption and decryption IP supports commonly used security encryption algorithms such as AES, and complies with national secret level 1 security specifications; in terms of OCP specification hardware system, the OAM module complies with the OAM1.1 specification, supports up to 550W TDP air cooling, and realizes full interconnection of 8 cards on the UBB motherboard in accordance with the OCP specification.
It is precisely because of this hardware configuration that the BR100 of BiRen Technology has achieved the superior performance shown in the figure above, and the BR 104 mentioned above is not inferior. Based on these two chips, the company launched the OAM module BiRenTM100 and the PCIe board BiRenTM104 to provide better support for the computing market.
In order to better serve customers, Biren Technology has invested heavily in software and ecology.
Software and ecosystem, neither can be lost
"We hope to build Biren Technology into the leading GPU manufacturer in China, make features and competitiveness in software and hardware, be recognized by customers, continue to build self-development, and become a successful chip design company." Biren Technology Xu Lingjie, co-founder and president of Ren Technology, emphasized in response to the author’s question.
He also pointed out: "From a customer perspective, large computing power and versatility are always the core of accelerated computing scenarios in data centers. It is necessary to continuously optimize various details of software and hardware to obtain the ultimate commercial advantage. This is a long-term It’s a very important job, but it’s of great significance and value, and it’s also the only way for chip companies to succeed.”
From this answer we can see the importance of software in BiRen Technology's future planning.
In fact, this is not a problem faced by Biren Technology. It can be said with certainty that whether it is domestic or foreign GPU companies and AI chip companies, if they want to make a breakthrough in the AI cloud market, they must cross a threshold set by industry overlord NVIDIA more than ten years ago - CUDA. .
According to NVIDIA, CUDA is a parallel computing platform and programming model invented by NVIDIA. It greatly improves computing performance by leveraging the processing power of graphics processing units (GPUs). So far, millions of CUDA-based GPUs have been sold, and software developers, scientists, and researchers are using CUDA in various fields, including image and video processing, computational biology and chemistry, fluid dynamics simulation, CT Image reconstruction, seismic analysis, ray tracing and more.
This is particularly serious in the artificial intelligence market where GPGPU and AI chips want to enter. Because of Nvidia's pre-emptive move, many manufacturers are now doing a lot of development and deployment based on CUDA. In order to save developers from the trouble of porting and development, "CUDA compatibility" has become an important factor that all new GPUs and AI chips need to consider. BiRen Technology is no exception.
According to Hong Zhou, the main reason for this situation is that NVIDIA's CUDA ecosystem is very deep, and almost all existing developers are CUDA users. Therefore, "CUDA compatibility" has become a means to reuse the existing ecosystem in the short term.
"However, due to the closed-source nature of CUDA and its rapid updates, it is difficult for latecomers to be perfectly compatible through instruction translation and other methods. Even partial compatibility will cause a large performance loss, resulting in continued lag behind NVIDIA in terms of cost performance. On the other hand, After all, CUDA is NVIDIA’s proprietary software stack and contains many proprietary features of NVIDIA GPU hardware, which are not reflected in other manufacturers’ chips,” Hongzhou said. For this reason, Biren Technology believes that in order to fully master the core software technology and perfectly adapt to its own hardware, the company needs to design and develop a set of independent and original programming models and software stack platforms.
This is why they created the BIRENSUPA (BIREN Scalable Unified Parallel Architecture) platform. According to reports, the platform is a heterogeneous computing platform that supports the development of deep learning and general computing applications on Biren Technology’s hardware devices. At its core is the SUPA programming model and tool chain.
Relevant information shows that the complete software stack of the BIRENSUPA platform includes firmware, drivers, compilers, tools, programming models, libraries, machine learning frameworks, and end-to-end application SDKs. In terms of deep learning frameworks, it is compatible with mainstream frameworks such as TensorFlow, PyTorch, and PaddlePaddle. The platform also supports BiRen Technology's self-developed high-performance inference engine and adapts to third-party inference engines, allowing customers to smoothly migrate existing GPU codes.
Hong Zhou pointed out that compared with CUDA, BIRENSUPA's programming paradigm and language style are very similar, and it also supports the unique features of BiRen Technology's hardware; with perfect documentation support, the user migration cost is very low. Looking to the future, BiRen Technology will continue to expand on the basis of BIRENSUPA's core layer to support more end-to-end business scenarios.
While building a software development platform, BiRen Technology also supports multiple precisions and more models to solve customer problems, thus making important investments in product software and ecosystem construction. The "Original Definition TF32+ Data Precision" mentioned above is one of the representatives.
According to Hongzhou, TF32+ is a data accuracy defined by Biren Technology. It is similar to TF32 first recommended by NVIDIA. It is also launched to meet the accuracy requirements of AI operations. Compared with traditional FP32, TF32 and TF32+ significantly improve computing throughput. Compared with TF32, TF32+ can add a 5-digit mantissa (mantissa) under the premise of meeting the same data representation range, thereby achieving higher accuracy and performance than TF32. It is suitable for a large number of multiplication and addition calculations and is single-precision. Excellent alternative for matrix calculations.
"Transformer occupies an indispensable position in today's deep learning field. It is widely used in natural language processing, image processing and other fields and has a great influence. BiRen Technology also attaches great importance to the Transformer model and lists it as one of the first models supported." said Hong Zhou.
In order to promote the company's GPU products, BiRen Technology is also working with partners to promote the adaptation of the company's products and servers. Xu Lingjie also pointed out at the product launch conference that the company has signed strategic cooperation agreements with leading customers in the Internet, cloud computing, finance, communications, data centers and other industries. For some key customers, the company has already started product adaptation and introduction testing.
At the same time, Biren Technology also joined hands with Inspur to release the "Haixuan" OAM server and cluster solution for data center cloud training. According to reports, this is the OAM server that has set a global computing power record, capable of 8PFLOPS floating point computing power, supporting PCI e 5.0 host and CXL interconnection protocol, with 1.8TB/s split interconnection bandwidth, 512 GB HBM2E Features such as memory and maximum power consumption of 7KW.
Based on such powerful performance, BiRen Technology has created a complete set of high-performance, cost-effective cluster computing infrastructure solutions for the market. Compared with the data center solutions of international manufacturers, this data center cluster solution uses only 1/3 of the number of servers to achieve higher floating point computing power, lower peak energy consumption and floor space, and can reduce the required standard coal power generation by 64%, making it have the characteristics of high energy efficiency, practicality, economy and environmental coordination.
At the same time, Biren Technology also announced the establishment of cooperation with Ping An Technology, China Mobile and GDS on the computing power provided by Biren Technology chips.
In addition to cooperation with the industry, integration with academic research is also a major bargaining chip for BiRen Technology to promote the development of the company's products. BiRen Technology also emphasized that the company has always adhered to the concept of rejuvenating the country through science and education, and believes that the industry, schools, scientific research institutions, etc. will cooperate with each other and give full play to their respective advantages, which will inevitably form a powerful research, development, and production integrated system and give play to the comprehensive advantages of 1+1>2 in cooperation. By establishing joint laboratories, research special programs, joint innovation centers, and developing science and technology courses with many top domestic universities such as Tsinghua University, Fudan University, Shanghai Jiaotong University, and Zhejiang University, BiRen Technology hopes to form close industry-university-research cooperation in the fields of chip architecture innovation, technology exploration, and application cooperation.
"A century-old foundation that lasts forever"
At the press conference, Zhang Wen revealed the company's development goal - to build a long-lasting business. From the relevant disclosures, we can also see that BiRen Technology's product tentacles have extended from GPGPU to graphics GPU.
It is true that under the influence of the current global geopolitical situation and the current domestic chip situation, it is necessary and inevitable to build a "century-old" local GPU company. But as many analysts have said, this is not an easy task for a chip startup. The above investment and actions of BiRen Technology are the strongest driving force for the company to move towards this goal.
Li Xinrong, co-CEO of Biren Technology, further pointed out, "Before Biren Technology, there were many AI chip and GPU chip companies that were constantly following the footsteps of NVIDIA products, and their benchmark products were always one step behind. But since Biren Technology, the products we designed The products developed will benchmark NVIDIA’s contemporary and even next-generation products, truly competing head-on with international mainstream products.”
In an interview, Li Xinrong also told the author that in addition to continuing to focus on the company's GPU, the company has also seen that 3U integration has become a development trend in the current data center cloud computing industry. Therefore, BiRen Technology has also carried out technical layout and accumulation in this field and expanded the data center technology ecosystem. The purpose is to conduct comprehensive benchmarking with international manufacturers in all scenarios of data centers, and simultaneously establish technical barriers under key technologies to accumulate experience in building high-efficiency new data centers. The layout of CPU and DPU can master core capabilities on the server host side and the network side, and strengthen BiRen Technology's GPU end-to-end technical capabilities in the data center.
*Disclaimer: This article is original by the author. The content of the article is the personal opinion of the author. The reprinting by Semiconductor Industry Watch is only to convey a different point of view. It does not mean that Semiconductor Industry Watch agrees or supports the view. If you have any objections, please contact Semiconductor Industry Watch.
Today is the 3128th content shared by "Semiconductor Industry Observation" with you. Welcome to pay attention.
Recommended Reading
★ The US Chip Act was officially signed, proposing the establishment of a Chiplet platform
★ Internet giants are eyeing this chip
★ Another big gamble by the chip giant
Semiconductor Industry Observation
" Semiconductor's First Vertical Media "
Real-time professional original depth
Identify the QR code , reply to the keywords below, and read more
Wafers | Integrated circuits | Equipment | Automotive chips | Storage | TSMC | AI | Packaging
Reply
Submit an article
and read "How to Become a Member of "Semiconductor Industry Watch""
Reply Search and you can easily find other articles you are interested in!