Recently, Meta launched its Meta Llama 3 open source large models with 8 billion and 70 billion parameters. The model introduces new features such as improved reasoning and more model sizes, and uses a new tokenizer to improve coding language efficiency and improve model performance.
As soon as the model was released, Intel verified that Llama 3 can run on a rich AI product portfolio including Intel® Xeon® processors, and disclosed the inference performance of the upcoming Intel Xeon 6-Core processor (codenamed Granite Rapids) for the Meta Llama 3 model.
Intel Xeon processors can meet the needs of demanding end-to-end AI workloads. Taking the fifth-generation Xeon processor as an example, each core has a built-in AMX acceleration engine, which can provide excellent AI reasoning and training performance. So far, this processor has been adopted by many mainstream cloud service providers. Not only that, Xeon processors can provide lower latency when performing general computing and can handle multiple workloads simultaneously.
In fact, Intel has been continuously optimizing the inference performance of large models on the Xeon platform. For example, the latency of PyTorch and Intel® Extension for PyTorch is reduced by 5 times compared to the software of the Llama 2 model. This optimization is achieved through the Paged Attention algorithm and tensor parallelism, because it can maximize the available computing power and memory bandwidth. The figure below shows the inference performance of the 8 billion parameter Meta Lama 3 model on the AWS m7i.metal-48x instance, which is based on the fourth-generation Intel Xeon Scalable processor.
Figure 1: Next Token Latency for Llama 3 on an AWS Instance
In addition, Intel also disclosed for the first time the performance test of its upcoming product, Intel® Xeon® 6-core processor (code-named Granite Rapids), on Meta Llama 3. The results show that compared with the fourth-generation Xeon processor, the Intel Xeon 6 processor has reduced the latency of the 8-billion-parameter Llama 3 inference model by 2 times, and can run inference models with larger parameters such as the 70-billion-parameter Llama 3 on a single dual-core server with a token latency of less than 100 milliseconds.
Figure 2: Next Token Delay of Llama 3 Based on Intel® Xeon® 6-Core Processor (Codename Granite Rapids)
Considering that Llama 3 has a more efficient encoding language tokenizer, the test used randomly selected prompts to quickly compare Llama 3 and Llama 2. With the same prompt, Llama 3 tokenizes 18% fewer tokens than Llama 2. Therefore, even though the 8 billion parameter Llama 3 model has more parameters than the 7 billion parameter Llama 2 model, the overall prompt inference latency is almost the same when running BF16 inference on an AWS m7i.metal-48xl instance (Llama 3 is 1.04 times faster than Llama 2 in this evaluation).
Developers can find instructions for running Llama 3 on Intel Xeon platforms here.
Product and Performance Information
Intel Xeon Processors:
Tested on Intel® Xeon® 6 processor (formerly codenamed Granite Rapids), using 2x Intel® Xeon® Platinum, 120 cores, HT on, Turbo on, NUMA 6, integrated accelerators available [used]: DLB[8], DSA[8], IAA[8], QAT[8], total memory 1536GB (24x64GB DDR5 8800 MT/s[8800 MT/s]), BIOS BHSDCRB1.IPC.0031.D44.2403292312, microcode 0x810001d0, 1x Ethernet controller I210 Gigabit network connection 1x SSK storage 953.9G, Red Hat Enterprise Linux 9.2 (Plow), 6.2.0-gn r.bkc.6.2.4.15.28.x86_64, based on testing by Intel on April 17, 2024.
Tested on 4th Generation Intel® Xeon® Scalable processors (formerly codenamed Sapphire Rapids), using AWS m7i.metal-48xl instance, 2x Intel® Xeon® Platinum 8488C, 48 cores, HT on, Turbo on, NUMA 2, integrated accelerators available [used]: DLB[8], DSA[8], IAA[8], QAT[8], total memory 768GB (16x32GB DDR5 4800 MT/s[4400 MT/s]); (16x16GB DDR5 4800 MT/s[4400 MT/s]), BIOS Amazon EC2, microcode 0x2b0000590, 1x Ethernet Controller Elastic Network Adapter (ENA) Amazon Elastic Block Store (EBS) 256G, Ubuntu 22.04.4 LTS, 6.5.0-1016-ws, based on Intel testing as of April 17, 2024.
Previous article:Making AI ubiquitous, Intel helps the Olympics unleash the charm of technology with AI platform innovation
Next article:Quantum internet key connection achieved for the first time
Recommended ReadingLatest update time:2024-11-16 09:30
- Popular Resources
- Popular amplifiers
- Wi-Fi 8 specification is on the way: 2.4/5/6GHz triple-band operation
- Three steps to govern hybrid multicloud environments
- Microchip Accelerates Real-Time Edge AI Deployment with NVIDIA Holoscan Platform
- Keysight Technologies FieldFox handheld analyzer with VDI spread spectrum module to achieve millimeter wave analysis function
- Qualcomm launches its first RISC-V architecture programmable connectivity module QCC74xM, supporting Wi-Fi 6 and other protocols
- Microchip Launches Broadest Portfolio of IGBT 7 Power Devices Designed for Sustainable Development, E-Mobility and Data Center Applications
- Infineon Technologies Launches New High-Performance Microcontroller AURIX™ TC4Dx
- Rambus Announces Industry’s First HBM4 Controller IP to Accelerate Next-Generation AI Workloads
- NXP FRDM platform promotes wireless connectivity
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- TI white paper "IQ: What is IQ and how to use it"
- Which manufacturers produce 16-bit microcontrollers with 251 cores?
- Totem pole in circuit,,,
- PA Test Solution
- Creative Modification Competition: The moderator gave me two booster boards, and I used both of them!
- Lots of books! Have you read any books on this Book Day? Have you bought any books?
- What does Dual Panel Flash mean in the introduction of ATSAMD51P20A?
- Capacitor three-point oscillation composed of operational amplifier
- 09 ADC acquisition and power management system (series post)
- 【Short-term weather forecast system】Scheme planning