The evolution of artificial intelligence requires a highly adaptable reasoning platform (WP023)
The growing size of models poses challenges to existing architectures
The demand for computing power for deep learning is growing at an alarming rate, with the pace of development in recent years having shortened from doubling every year to doubling every three months. The increasing capacity of deep neural network (DNN) models has shown improvements in areas ranging from natural language processing to image processing - deep neural networks are a key technology for real-time applications such as autonomous driving and robotics. For example, Facebook's research shows that the ratio of accuracy to model size increases linearly, and accuracy can be even further improved by training on larger datasets.
Model sizes are now growing much faster than Moore’s Law at many cutting-edge technologies, with trillion-parameter models being considered for some applications. While few production systems will reach the same extreme, the impact of the number of parameters on performance in these examples will have ripple effects in real-world applications. The growth in model size presents a challenge for implementers. If silicon scaling roadmaps cannot be relied upon exclusively, other solutions will be needed to meet the demand for increased model capacity at a cost commensurate with the scale of deployment. This growth requires customized architectures to maximize the performance of each available transistor.
Figure 1: Model size growth rate (Source: Linley Group)
Parameters (log scale): Parameters (log scale)
Image-processingmodels: Image processing models
Language-processingmodels: Language processing models
Deep learning architectures are evolving rapidly as the number of parameters grows rapidly. While deep neural networks continue to widely use a combination of traditional convolutional, fully connected, and pooling layers, other structures have emerged on the market, such as self-attention networks in natural language processing (NLP). They still require high-speed matrix and tensor-oriented algorithms, but changes in storage access patterns may cause trouble for graphics processing units (GPUs) and currently available accelerators.
The architectural changes mean that commonly used metrics such as trillions of operations per second (TOps) are becoming less relevant. Often, processing engines cannot reach their peak TOps scores because the storage and data transfer infrastructure cannot provide sufficient throughput without changing the way models are processed. For example, batching input samples is a common approach because it often increases the parallelism available on many architectures. However, batching increases the latency of responses, which is often unacceptable in real-time inference applications.
Numerical flexibility is a way to achieve high throughput
One way to improve inference performance, which also represents a parallel to the rapid evolution of architectures, is to adapt the numerical resolution of computations to the needs of individual layers. In general, many deep learning models can accept significant precision loss and increased quantization error during inference compared to the precision required for training, which is typically performed using standard or double-precision floating-point arithmetic. These formats are able to support high-precision numbers over a very wide dynamic range. This feature is important in training, where the common backpropagation algorithm requires small changes to many weights on each pass to ensure convergence.
Traditionally, floating-point operations require extensive hardware support to enable low-latency processing of high-resolution data types; they were originally developed to support scientific applications on high-performance computers, where the overhead required to fully support it was not a major issue.
Many inference deployments convert models to use fixed-point operations, which significantly reduces precision. In these cases, the impact on accuracy is usually minimal. In fact, some layers can be converted to use extremely limited numeric ranges, even binary or ternary values are viable options.
However, integer arithmetic is not always an efficient solution. Some filters and data layers require high dynamic range. To meet this requirement, integer hardware may need to process data with 24-bit or 32-bit word lengths, which will consume more resources than 8-bit or 16-bit integer data types, which are easily supported in typical single instruction multiple data (SIMD) accelerators.
One compromise is to use a narrow floating-point format, such as one that fits into a 16-bit word length. This choice enables greater parallelism, but it does not overcome the performance barriers inherent in most floating-point data types. The problem is that both parts of the floating-point format need to be adjusted after each calculation because the most significant bit of the mantissa is not explicitly stored. Therefore, the size of the exponent needs to be adjusted through a series of logical shift operations to ensure that the implied leading "1" is always present. The benefit of this normalization operation is that there is only one representation for any single numerical value, which is important for software compatibility in user applications. However, for many signal processing and AI inference routines, this is unnecessary.
Much of the hardware overhead for these operations can be avoided by not normalizing the mantissa and adjusting the exponent after each calculation. This is the approach taken by block floating-point arithmetic, a data format that has been used in standard fixed-point digital signal processing (DSP) to improve performance in audio processing algorithms for mobile devices, digital subscriber line (DSL) modems, and radar systems.
Figure 2: Block floating point calculation example
mantissa: the last digit
block exponent: block index
With block floating-point arithmetic, there is no need to left-align the mantissa. Data elements used in a series of calculations can share the same exponent, a change that simplifies the design of execution pipelines. The loss of precision caused by rounding values that occupy similar dynamic ranges can be minimized. The appropriate range is selected for each calculation block at design time. After the calculation block is completed, the exit function can round and normalize the values so that they can be used as regular floating-point values when needed.
Support for block floating-point format is one of the features of the Machine Learning Processor (MLP). Achronix's Speedster® 7t FPGA devices and Speedcore™ eFPGA architecture provide this highly flexible arithmetic logic unit. The machine learning processor is optimized for dot products and similar matrix operations required by artificial intelligence applications. The support for block floating point in these machine learning processors provides a substantial improvement over traditional floating point. The throughput of 16-bit block floating-point operations is 8 times that of traditional half-precision floating-point operations, making it as fast as 8-bit integer operations, with only a 15% increase in active power consumption compared to operations in integer form only.
Another data type that may be important is the TensorFloat 32 (TF32) format, which reduces precision compared to the standard precision format but maintains a high dynamic range. TF32 also lacks the optimized throughput of block exponent processing but is useful for some applications where easy portability of models created using TensorFlow and similar environments is important. The high degree of flexibility enabled by the machine learning processor in the Speedster7t FPGA makes it possible to process TF32 algorithms using 24-bit floating-point mode. In addition, the high configurability of the machine learning processor means that a new, block floating-point version of TF32 can be supported in which four samples share the same exponent. The block floating-point TF32 supported by the machine learning processor has twice the density of traditional TF32.
Figure 3: Structure of the Machine Learning Processor (MLP)
Wireless: Wireless
AI/ML: Artificial Intelligence/Machine Learning
InputValues: Input values
InputLayer: Input layer
HiddenLayer 1: Hidden layer 1
HiddenLayer 2: Hidden layer 2
OutputLayer: Output layer
Processing flexibility optimizes algorithm support
While the ability of machine learning processors to support multiple data types is critical for inference applications, their power can only be unleashed when they are part of the FPGA architecture. The ability to easily define different interconnect structures makes FPGAs stand out from most architectures. The ability to define both interconnect and arithmetic logic in an FPGA simplifies the process of building a balanced architecture. Designers can not only build direct support for custom data types, but also define the most appropriate interconnect structure to transfer data into and out of the processing engine. The reprogrammable nature further provides the ability to cope with the rapid evolution of artificial intelligence. Changes in data flow in the custom layer can be easily supported by modifying the FPGA's logic.
A major advantage of FPGAs is that functions can be easily switched between optimized embedded compute engines and programmable logic implemented by lookup table units. Some functions map well to embedded compute engines, such as the Speedster7t MLP. As another example, higher precision algorithms are best assigned to machine learning processors (MLPs) because the increased bit width leads to an exponential increase in the size of functional units that are used to implement functions such as high-speed multiplication.
Previous article:Rockwell Automation acquires Plex Systems
Next article:The North China Training Center of Techman Robotics was established at the Tianjin Luosheng Headquarters
- Popular Resources
- Popular amplifiers
- A review of deep learning applications in traffic safety analysis
- Dual Radar: A Dual 4D Radar Multimodal Dataset for Autonomous Driving
- A review of learning-based camera and lidar simulation methods for autonomous driving systems
- Multi-port and shared memory architecture for high-performance ADAS SoCs
- Molex leverages SAP solutions to drive smart supply chain collaboration
- Pickering Launches New Future-Proof PXIe Single-Slot Controller for High-Performance Test and Measurement Applications
- CGD and Qorvo to jointly revolutionize motor control solutions
- Advanced gameplay, Harting takes your PCB board connection to a new level!
- Nidec Intelligent Motion is the first to launch an electric clutch ECU for two-wheeled vehicles
- Bosch and Tsinghua University renew cooperation agreement on artificial intelligence research to jointly promote the development of artificial intelligence in the industrial field
- GigaDevice unveils new MCU products, deeply unlocking industrial application scenarios with diversified products and solutions
- Advantech: Investing in Edge AI Innovation to Drive an Intelligent Future
- CGD and QORVO will revolutionize motor control solutions
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- msp430g2553 microcontroller
- Thank you + my family for your understanding and support! + Thank you to my new colleagues for their generous help along the way! + Thank you to the leaders of my company for their trust in me! + Thank you to the EE forum for providing me with professional knowledge!
- TI releases the latest industrial electronics reference design!
- Understanding wireless communication technology
- Infineon Tmall flagship store offers huge discounts Part 1 - Limited cashback on a first-come, first-served basis, Double 11 special promotion!
- Good evening, experts.
- You may need oscilloscope information urgently, especially analog ones. You can get the information here.
- XMC4800 Review (Part 2) - Unboxing and Lighting
- Share the use of Allegro Color command with everyone
- Comparison of technical indicators between N76E003AT20 and STM8S003F3P6