The evolution of artificial intelligence requires a highly adaptable reasoning platform (WP023)-EEWORLD

Collect

The evolution of artificial intelligence requires a highly adaptable reasoning platform (WP023)

The growing size of models poses challenges to existing architectures

The demand for computing power for deep learning is growing at an alarming rate, with the pace of development in recent years having shortened from doubling every year to doubling every three months. The increasing capacity of deep neural network (DNN) models has shown improvements in areas ranging from natural language processing to image processing - deep neural networks are a key technology for real-time applications such as autonomous driving and robotics. For example, Facebook's research shows that the ratio of accuracy to model size increases linearly, and accuracy can be even further improved by training on larger datasets.

Model sizes are now growing much faster than Moore’s Law at many cutting-edge technologies, with trillion-parameter models being considered for some applications. While few production systems will reach the same extreme, the impact of the number of parameters on performance in these examples will have ripple effects in real-world applications. The growth in model size presents a challenge for implementers. If silicon scaling roadmaps cannot be relied upon exclusively, other solutions will be needed to meet the demand for increased model capacity at a cost commensurate with the scale of deployment. This growth requires customized architectures to maximize the performance of each available transistor.

Figure 1: Model size growth rate (Source: Linley Group)

Parameters (log scale): Parameters (log scale)

Image-processingmodels: Image processing models

Language-processingmodels: Language processing models

Deep learning architectures are evolving rapidly as the number of parameters grows rapidly. While deep neural networks continue to widely use a combination of traditional convolutional, fully connected, and pooling layers, other structures have emerged on the market, such as self-attention networks in natural language processing (NLP). They still require high-speed matrix and tensor-oriented algorithms, but changes in storage access patterns may cause trouble for graphics processing units (GPUs) and currently available accelerators.

The architectural changes mean that commonly used metrics such as trillions of operations per second (TOps) are becoming less relevant. Often, processing engines cannot reach their peak TOps scores because the storage and data transfer infrastructure cannot provide sufficient throughput without changing the way models are processed. For example, batching input samples is a common approach because it often increases the parallelism available on many architectures. However, batching increases the latency of responses, which is often unacceptable in real-time inference applications.

Numerical flexibility is a way to achieve high throughput

One way to improve inference performance, which also represents a parallel to the rapid evolution of architectures, is to adapt the numerical resolution of computations to the needs of individual layers. In general, many deep learning models can accept significant precision loss and increased quantization error during inference compared to the precision required for training, which is typically performed using standard or double-precision floating-point arithmetic. These formats are able to support high-precision numbers over a very wide dynamic range. This feature is important in training, where the common backpropagation algorithm requires small changes to many weights on each pass to ensure convergence.

Traditionally, floating-point operations require extensive hardware support to enable low-latency processing of high-resolution data types; they were originally developed to support scientific applications on high-performance computers, where the overhead required to fully support it was not a major issue.

Many inference deployments convert models to use fixed-point operations, which significantly reduces precision. In these cases, the impact on accuracy is usually minimal. In fact, some layers can be converted to use extremely limited numeric ranges, even binary or ternary values are viable options.

However, integer arithmetic is not always an efficient solution. Some filters and data layers require high dynamic range. To meet this requirement, integer hardware may need to process data with 24-bit or 32-bit word lengths, which will consume more resources than 8-bit or 16-bit integer data types, which are easily supported in typical single instruction multiple data (SIMD) accelerators.

One compromise is to use a narrow floating-point format, such as one that fits into a 16-bit word length. This choice enables greater parallelism, but it does not overcome the performance barriers inherent in most floating-point data types. The problem is that both parts of the floating-point format need to be adjusted after each calculation because the most significant bit of the mantissa is not explicitly stored. Therefore, the size of the exponent needs to be adjusted through a series of logical shift operations to ensure that the implied leading "1" is always present. The benefit of this normalization operation is that there is only one representation for any single numerical value, which is important for software compatibility in user applications. However, for many signal processing and AI inference routines, this is unnecessary.

Much of the hardware overhead for these operations can be avoided by not normalizing the mantissa and adjusting the exponent after each calculation. This is the approach taken by block floating-point arithmetic, a data format that has been used in standard fixed-point digital signal processing (DSP) to improve performance in audio processing algorithms for mobile devices, digital subscriber line (DSL) modems, and radar systems.

Figure 2: Block floating point calculation example

mantissa: the last digit

block exponent: block index

With block floating-point arithmetic, there is no need to left-align the mantissa. Data elements used in a series of calculations can share the same exponent, a change that simplifies the design of execution pipelines. The loss of precision caused by rounding values that occupy similar dynamic ranges can be minimized. The appropriate range is selected for each calculation block at design time. After the calculation block is completed, the exit function can round and normalize the values so that they can be used as regular floating-point values when needed.

Support for block floating-point format is one of the features of the Machine Learning Processor (MLP). Achronix's Speedster® 7t FPGA devices and Speedcore™ eFPGA architecture provide this highly flexible arithmetic logic unit. The machine learning processor is optimized for dot products and similar matrix operations required by artificial intelligence applications. The support for block floating point in these machine learning processors provides a substantial improvement over traditional floating point. The throughput of 16-bit block floating-point operations is 8 times that of traditional half-precision floating-point operations, making it as fast as 8-bit integer operations, with only a 15% increase in active power consumption compared to operations in integer form only.

Another data type that may be important is the TensorFloat 32 (TF32) format, which reduces precision compared to the standard precision format but maintains a high dynamic range. TF32 also lacks the optimized throughput of block exponent processing but is useful for some applications where easy portability of models created using TensorFlow and similar environments is important. The high degree of flexibility enabled by the machine learning processor in the Speedster7t FPGA makes it possible to process TF32 algorithms using 24-bit floating-point mode. In addition, the high configurability of the machine learning processor means that a new, block floating-point version of TF32 can be supported in which four samples share the same exponent. The block floating-point TF32 supported by the machine learning processor has twice the density of traditional TF32.

Figure 3: Structure of the Machine Learning Processor (MLP)

Wireless: Wireless

AI/ML: Artificial Intelligence/Machine Learning

InputValues: Input values

InputLayer: Input layer

HiddenLayer 1: Hidden layer 1

HiddenLayer 2: Hidden layer 2

OutputLayer: Output layer

Processing flexibility optimizes algorithm support

While the ability of machine learning processors to support multiple data types is critical for inference applications, their power can only be unleashed when they are part of the FPGA architecture. The ability to easily define different interconnect structures makes FPGAs stand out from most architectures. The ability to define both interconnect and arithmetic logic in an FPGA simplifies the process of building a balanced architecture. Designers can not only build direct support for custom data types, but also define the most appropriate interconnect structure to transfer data into and out of the processing engine. The reprogrammable nature further provides the ability to cope with the rapid evolution of artificial intelligence. Changes in data flow in the custom layer can be easily supported by modifying the FPGA's logic.

A major advantage of FPGAs is that functions can be easily switched between optimized embedded compute engines and programmable logic implemented by lookup table units. Some functions map well to embedded compute engines, such as the Speedster7t MLP. As another example, higher precision algorithms are best assigned to machine learning processors (MLPs) because the increased bit width leads to an exponential increase in the size of functional units that are used to implement functions such as high-speed multiplication.

[1] [2] [3]

Reference address：The evolution of artificial intelligence requires a highly adaptable reasoning platform (WP023)

Previous article：Rockwell Automation acquires Plex Systems
Next article：The North China Training Center of Techman Robotics was established at the Tianjin Luosheng Headquarters

Popular Resources
Popular amplifiers