How did the Arm architecture become the cornerstone of global computing step by step?-EEWORLD

Collect

In the past, when people mentioned Arm, they thought more about mobile phones and embedded systems. However, since 2018, Arm announced the launch of Neoverse and entered the high-performance computing market. Four years have passed. Now, Arm architecture infrastructure has become an obvious trend. As Arm CEO Rene Haas said: "All major public cloud service providers in the world are now using Arm architecture."

Arm Neoverse milestones in 2022

Chris Bergey, senior vice president and general manager of Arm's Infrastructure Business Unit, reviewed the important events of Arm Neoverse in 2022, including:

Arm is now used in major public clouds around the world, including AWS, Microsoft, Google, Alibaba, Oracle and other technology giants. It is worth mentioning AWS. A month ago, Amazon Vice President James Hamilton talked about how they started their journey to custom chips. In 2013, James made a two-point argument to Jeff Bezos. First, given the number of chips shipped using the Arm architecture, he was sure that Arm would eventually design an excellent server CPU; second, James noticed that over time, more and more functions were gradually migrating from the motherboard to the SoC. The mobile phone field has already shown signs, and he believes that the server will naturally follow suit. AWS has been building custom servers for many years and has created more value for customers through customization. But if all innovations in servers are transferred to chips and AWS does not build chips, their innovation will be limited. The conclusion drawn from James' argument is that AWS needs to start building CPUs. This also prompted them to acquire Annapurna Labs, which created the AWS Graviton series of CPUs based on Arm Neoverse.

In the field of 5G RAN, Neoverse is everywhere. At the Mobile World Congress (MWC), Dell announced that it would use Marvell's OCTEON Fusion platform to develop O-RAN accelerator cards. Qualcomm has also reached cooperation with Rakuten and HPE, also based on the Arm Neoverse platform.

In the HPC market, NVIDIA released the Grace superchip for AI and high-performance computing (HPC), which is based on the latest Armv9 architecture. A single socket has 144 CPU cores, has the highest single-thread core performance, and supports Arm's new generation of vector extensions. It can achieve twice the memory bandwidth and energy efficiency of today's leading server chips.

In addition, at the software and system level, Arm's Neoverse is also gaining more and more recognition. For example, VMware uses DPU to carry out the Monterrey project, RedHat's OpenShift supports Arm architecture, SAP HANA is migrating its cloud infrastructure to AWS Graviton, and HPE's ProLiant 11th generation platform is equipped with the Ampere Altra processor based on Arm Neoverse, etc.

Neoverse has achieved a series of achievements in processors, including:

The first CPU to exceed 1 terabyte per second of total memory bandwidth

The first CPU with more than 100 cores on a single die, with the number of cores reaching 128

The first CPU to bring DDR5 and PCIe Gen5.0 to market

The first CPU to break the 500 integer score in the SPEC CPU 2017 benchmark

Arm releases latest Neoverse roadmap

"Arm architecture is the cornerstone of the future of global computing," said Bergey. "Today's infrastructure is customized, from SSD to HDD, from DPU to video accelerator, server CPU is the last standard product and will not continue to develop as a general-purpose product. At the same time, computing workloads are growing rapidly and becoming more complex. ML and AI are playing a replacing role. Another problem is power consumption. Currently, the electricity expenditure of large Internet companies accounts for 30-40% of the total cost of ownership (TCO), which is only slightly lower than that of telecommunications network operators."

For this reason, Arm announced the latest Neoverse roadmap to meet infrastructure upgrade requirements.

Neoverse is divided into V, N and E series cores, targeting three different types of performance. The V core pursues maximum performance, the E core focuses on performance efficiency, and the N core focuses more on throughput efficiency.

As shown in the figure, whether it is the V, N or E series, Neoverse has a detailed roadmap upgrade plan announced.

Dermot O'Driscoll, vice president of product solutions at Arm Infrastructure Business Unit, said that single-chip performance and single-thread performance are two key indicators for cloud decision makers. Among them, single-thread performance is an indicator of whether decision makers can migrate workloads with the highest "scalability" requirements and high performance requirements to Arm. Single-chip performance is the key to maximizing the value of investment through a large number of "horizontal expansion" workloads running on the platform. "AWS Graviton3 using Arm Neoverse V1 cores can provide the highest single-thread performance, and even the upcoming competing CPUs cannot shake its leading position. We expect Graviton3 to provide excellent price-performance and performance per watt, while Ampere Altra Max and Alibaba's Yitian 710 can provide the best single-chip throughput among all CPUs."

In addition to hardware, Driscoll also mentioned that Arm has been working hard to implement and optimize full-stack solutions, from architecture and IP to technology libraries, operating environments and compilers, to achieve optimal performance across the entire infrastructure software range.

The actual test results also show that Arm has achieved or even surpassed traditional architectures in infrastructure processing. Taking the mainstream data storage MongoDB application as an example, Driscoll compared the instances based on Graviton2 and Intel Xeon from AWS and found that MongoDB performance was 117% better than the x86 architecture.

Driscoll also said that as machine learning becomes more popular, Neoverse V1 also has a set of features specifically designed to enhance the performance of ML applications. These include:

On the architecture side, Bfloat16 (BF16) was added

Adjusted the microarchitecture of V1, N2, and subsequent designs to improve BF16 execution with BERT

Added BF16 support to Arm Compute Library (ACL)

Integrating ACL into the oneDNN ML framework

oneDNN framework with Tensorflow to run BERT

Similarly, Arm is running BERT on AWS EC2 C7g based on V1 cores and comparing it with C6i using the latest Xeon cores. The BF16-optimized stack on the Arm architecture performs 80% better than Intel. At the same time, the addition of BF16 and Int8 MatMul in V1 means that ML models can be more compactly embedded in memory, so they require less memory bandwidth, making Graviton3's ML performance 3 times that of Graviton2.

When talking about the Neoverse V2 platform, Driscoll said that the platform can simultaneously meet the three needs of customers: "want to improve the performance of cloud workloads", "continue to advance single-threaded performance while balancing power consumption and area", and "ship as soon as possible to help quickly open up the market."

In terms of machine learning performance, Neoverse V2 will provide market-leading integer performance. Arm currently measures estimates with SPEC Integer Rate, and has been using various cloud infrastructure workloads in the model to adjust the microarchitecture. The results of the entire series make Driscoll "very excited". For workloads like HPC that are rapidly migrating to the cloud, vector performance is still important. On Neoverse V2, Arm has completed the transition from SVE to SVE2, which can help meet more non-HPC ML type workloads while adding more encrypted instructions. In addition, the vector engine has been reconstructed into 4-channel 128-bit, and the microarchitecture has been adjusted to increase its effective throughput.

In addition, Neoverse V2 has made a series of improvements in the system layer, IO layer, and security layer, which can be seen from the performance of NVIDIA's Grace super chip.

Driscoll did not reveal more about the progress of the N and E series, only saying that the N series product line will be updated next year. In terms of market adoption, nearly 20 customers are currently designing based on the N2 platform.

Driscoll said: "The infrastructure market is being redefined, centered on Arm's high-performance, scalable and efficient computing, and enhanced by dedicated processing from our partners. Building on the principles of the Arm Neoverse platform roadmap, we will lay a new starting point for global computing infrastructure." This is also a summary and outlook of the four years since the birth of Arm Neoverse.

Keywords：Arm Reference address：How did the Arm architecture become the cornerstone of global computing step by step?

Previous article：Ampere's next-generation processors will abandon Arm and use customized cores
Next article：Intel and Baidu PaddlePaddle jointly create an AI developer ecosystem to accelerate the intelligent upgrade of thousands of industries

Recommended ReadingLatest update time:2024-11-16 12:43

Arm architecture Arm core analysis

　The Arm architecture dominates the embedded processing and computing market today, but it has come a long way over the past few decades. It started as a processor for home computers in the 1980s, then became the basis for mobile phone chips in the 1990s. Today, Arm is a strong competitor in almost every technology ma

[Microcontroller]

ARM Study Notes 002 gcc-4.3.2 compiler does not support hardware division operation solution

Generally, if gcc cannot use division, the error after make is: Division is used at each error location. My makefile code is as follows: CC = arm-linux-gcc LD = arm-linux-ld AR = arm-linux-ar OBJCOPY = arm-linux-objcopy OBJDUMP = arm-linux-objdump INCLUDEDIR := $(shell pwd)/include CFLAGS := -Wall -O2 CPPFLAG

[Microcontroller]

ARM Study Notes 002 gcc-4.3.2 compiler does not support hardware division operation solution

S3C2440 ARM chip clock system

The clock control logic in S3C2440A can generate the necessary clock signals, including FCLK of CPU, HCLK of AHB bus peripherals and PCLK of APB bus peripherals. S3C2440A contains two phase-locked loops (PLL): one for FCLK, HCLK and PCLK, and the other for USB module (48MHz). Figure 7-1 shows the block diagram of

[Microcontroller]

ARM Basics: Why does C language (function call) require a stack, but assembly language does not need a stack

I have read a lot of analysis about uboot before, which mentioned that the stack should be prepared for the operation of C language. In the start.S assembly code of Uboot, I also saw the stack pointer initialization action for system initialization. However, I have only seen people say that system initial

[Microcontroller]

[ARM Application] Solve the problem of LCD screen automatically closing and closing without changing the kernel code

It can be implemented in the application. Do not change the kernel driver. Write a small program: #include stdio.h #include fcntl.h #include sys/ioctl.h void keep_LCD_screen(void) { int fd; fd = open("/dev/tty0", O_RDWR); write(fd, "\033 ", 8); close(fd); } int main(void

[Microcontroller]

How to learn embedded development ARM

1. What is embedded? 　 2. What knowledge is needed for embedded systems? 　　　Knowing these two points, it's easy! I started planning my study route. The basic theory of computers cannot be lost, so I focused on the composition principle, data structure, operating system, and C++. At the same time, I learned from the

[Microcontroller]

ARM processor architecture--processor working status

Before, I have been looking at ARM driver development, and recently I have looked at UCOS, and I have also begun to have a deeper understanding of ARM architecture. When I studied microcomputer principles in college, I learned the x86 architecture, which is quite different from ARM. Below I will sort out some of the c

[Microcontroller]

ARM9 study notes - MMU

I remember one time I applied for an ARM-Linux software engineer position. I was asked how virtual memory is managed in ARM. Since I only knew about MMU in X86 platform, I was stumped. It turned out that what I learned was just the tip of the iceberg. There are still many things worth learning in depth. To develop dri

[Microcontroller]

Popular Resources
Popular amplifiers