The fusion of multi-level storage and analog in-memory computing solves AI edge problems-EEWORLD

Collect

Machine learning and deep learning have become an integral part of our lives. Artificial intelligence (AI) applications using natural language processing (NLP), image classification, and object detection are deeply embedded in many of the devices we use. Most AI applications are well served by cloud engines, such as getting word predictions when replying to emails in Gmail.

While we can enjoy the benefits of these AI applications, this approach leads to challenges in terms of privacy, power consumption, latency, and cost. These issues can be addressed if there is a local processing engine that can perform some or all of the computation (inference) at the source of the data. Traditional digital neural networks have a memory power bottleneck that makes this difficult to achieve. To address this issue, multi-level memory can be combined with analog in-memory computing methods to enable the processing engine to meet lower milliwatt (mW) to microwatt (μW) power requirements, thereby performing AI inference at the edge of the network.

Challenges of using cloud engines to serve AI applications

If AI applications are served through cloud engines, users must upload some data to the cloud in an active or passive manner, and the computing engine processes the data and provides predictions in the cloud, and then sends the prediction results to downstream users for use. The following outlines the challenges of this process:

Figure 1: Data transmission from edge to cloud

1. Privacy issues: With always-on, always-aware devices, personal data and/or confidential information are at risk of being misused during upload or during the retention period in the data center.

2. Unnecessary power consumption: If every bit of data is transmitted to the cloud, power is consumed by the hardware, radios, transmission devices, and unnecessary calculations in the cloud.

3. Latency of small batch inference: If the data comes from the edge, it sometimes takes at least one second to receive a response from the cloud system. When the delay exceeds 100 milliseconds, people will notice it, resulting in a poor user experience.

4. Data economy requires value creation: Sensors are ubiquitous and cheap, but they generate a lot of data. It is not cost-effective to upload every bit of data to the cloud for processing.

To address these challenges using local processing engines, the neural network that performs inference operations must first be trained using a specified dataset for the target use case. This typically requires high-performance computing (and memory) resources and floating-point arithmetic operations. Therefore, the training portion of the machine learning solution still needs to be implemented on a public or private cloud (or local GPU, CPU, and FPGA Farm), combined with the dataset to generate the best neural network model. Inference operations on neural network models do not require backpropagation, so once the model is ready, it can be deeply optimized for local hardware using a small computing engine. Inference engines typically require a large number of multiply-accumulate (MAC) engines, followed by activation layers (such as rectified linear units (ReLU), sigmoid functions, or hyperbolic tangent functions, depending on the complexity of the neural network model) and pooling layers between layers.

Most neural network models require a lot of MAC operations. For example, even the relatively small "1.0 MobileNet-224" model has 4.2 million parameters (weights) and requires up to 569 million MAC operations to perform one inference. Most of these models are dominated by MAC operations, so the focus here is on the computational part of the machine learning calculations while also looking for opportunities to create better solutions. Figure 2 below shows a simple fully connected two-layer network. The input neurons (data) are processed by the first layer of weights. The output neurons of the first layer are processed by the second layer of weights and provide predictions (for example, whether the model can find a cat face in a given image). These neural network models use a "dot product" operation to calculate each neuron in each layer, as shown in the following formula:

(For simplicity, the "bias" term is omitted from the formula).

Figure 2: A fully connected two-layer neural network

In digital neural networks, weights and input data are stored in DRAM/SRAM. The weights and input data need to be moved to a MAC engine for inference. According to the figure below, with this approach, most of the power consumption comes from fetching the model parameters and inputting the data to the ALU where the actual MAC operation occurs. From an energy perspective, a typical MAC operation using digital logic gates consumes about 250 fJ of energy, but the energy consumed during data transfer exceeds the calculation itself by two orders of magnitude, reaching the range of 50 picojoules (pJ) to 100 pJ. To be fair, there are many design tricks to minimize the data transfer from memory to the ALU, but the entire digital scheme is still limited by the von Neumann architecture. This means that there are a lot of opportunities to reduce power waste. What if the energy consumption of performing a MAC operation can be reduced from about 100 pJ to a few fractions of a pJ?

Eliminate memory bottlenecks while reducing power consumption

Performing inference-related operations at the edge becomes feasible if the memory itself can be used to eliminate the previous memory bottleneck. Using an in-memory computing approach minimizes the amount of data that must be moved. This in turn eliminates energy wasted during data transfer. Flash cells have low active power consumption when operating and consume almost no energy in standby mode, thus further reducing energy consumption.

Figure 3: Memory bottleneck in machine learning computations

Source: Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” 2016 International Symposium on Computer Architecture.

An example of this approach is memBrain™ technology from Silicon Storage Technology (SST), a subsidiary of Microchip. The solution is based on SST’s SuperFlash® memory technology, which has become the recognized standard for multi-level memory for microcontroller and smart card applications. This solution has an in-memory computing architecture that allows the computation to be done where the weights are stored. There is no data movement for the weights, only the input data needs to be moved from the input sensors (such as cameras and microphones) to the memory array, thus eliminating the memory bottleneck in the MAC computation.

This memory concept is based on two basic principles: (a) the analog current response of a transistor is based on its threshold voltage (Vt) and the input data, and (b) Kirchhoff's current law, which states that the algebraic sum of the currents in a network of multiple conductors that meet at a point is zero. It is also important to understand the basic non-volatile memory (NVM) bit cell in this multi-level memory architecture. The figure below (Figure 4) shows two ESF3 (3rd generation embedded SuperFlash) bit cells with a shared erase gate (EG) and source line (SL). Each bit cell has five terminals: control gate (CG), work line (WL), erase gate (EG), source line (SL), and bit line (BL). The erase operation of the bit cell is performed by applying a high voltage to EG. The program operation is performed by applying high/low voltage bias signals to WL, CG, BL, and SL. The read operation is performed by applying low voltage bias signals to WL, CG, BL, and SL.

Figure 4: SuperFlash ESF3 unit

This memory architecture allows users to program memory bit cells at different Vt voltages by fine-tuning the programming operation. The memory technology uses an intelligent algorithm to adjust the floating gate (FG) voltage of the memory cell to obtain a specific current response from the input voltage. Depending on the requirements of the end application, the cell can be programmed in the linear region or the subthreshold region.

Figure 5 illustrates the ability to store multiple voltages in a memory cell. For example, we want to store a 2-bit integer value in a memory cell. For this case, we need to program each cell in the memory array with one of four 2-bit integer values (00, 01, 10, 11), and at this point we need to program each cell with one of four possible Vt values with sufficient spacing. The four IV curves below correspond to the four possible states, and the current response of the cell depends on the voltage applied to the CG.

Figure 5: Programming Vt voltage in ESF3 cell

The weights of the trained model are programmed to the floating gate Vt of the memory cells. Therefore, all the weights of each layer of the trained model (e.g., a fully connected layer) can be programmed on a matrix-like memory array, as shown in Figure 6. For inference operations, digital inputs (e.g., from a digital microphone) are first converted to analog signals using a digital-to-analog converter (DAC) and then applied to the memory array. The array then performs thousands of MAC operations in parallel on the specified input vector, and the resulting outputs then enter the activation stage of the corresponding neurons, which are then converted back to digital signals using an analog-to-digital converter (ADC). These digital signals are then pooled before entering the next layer.

Figure 6: Weight matrix memory array for inference

This type of multi-level memory architecture is very modular and flexible. Many memory slices can be combined to form a large model that is a mixture of weight matrices and neurons, as shown in Figure 7. In this example, the MxN slice configuration is connected together through analog and digital interfaces between the slices.

Figure 7: Modular structure of memBrain™

So far, we have mainly discussed the chip implementation of this architecture. A software development kit (SDK) is provided to help develop the solution. In addition to the chip, the SDK also helps in the development of the inference engine. The SDK process is agnostic to the training framework. Users can create neural network models with floating-point calculations as needed in all the provided frameworks such as TensorFlow, PyTorch, or other frameworks. After the model is created, the SDK helps quantize the trained neural network model and map it to a memory array. In this array, vector-matrix multiplications can be performed with input vectors from sensors or computers.

Figure 8: memBrain™ SDK process

The advantages of a multi-level memory approach combined with in-memory computing capabilities include:

1. Ultra-low power: Technology designed for low-power applications. The first advantage in terms of power consumption is that this solution uses in-memory computing, so no energy is wasted transferring data and weights from SRAM/DRAM during calculations. The second advantage in terms of power consumption is that the flash cells operate at very low current in subthreshold mode, so the active power consumption is very low. The third advantage is that there is almost no energy consumption in standby mode, because the non-volatile memory cells do not require any power to save the data of the always-on device. This approach is also very suitable for exploiting the sparsity of weights and input data. If the input data or weight is zero, the memory bit cell will not activate.

2. Reduced package size: This technology uses a split-gate (1.5T) cell architecture, while the SRAM cell in the digital implementation is based on a 6T architecture. In addition, this cell is much smaller than the 6T SRAM cell. In addition, a single cell can store a full 4-bit integer value, rather than requiring 4*6 = 24 transistors to achieve this as in an SRAM cell, which essentially reduces the on-chip footprint.

3. Reduce development costs: Due to memory performance bottlenecks and the limitations of von Neumann architecture, many dedicated devices (such as Nvidia's Jetsen or Google's TPU) tend to improve performance per watt by shrinking the geometry, but this approach is very costly to solve edge computing challenges. By combining analog in-memory computing with multi-level memory, on-chip computing can be done in flash cells, which allows the use of larger geometry sizes while reducing mask costs and shortening development cycles.

The promise of edge computing applications is huge. However, power and cost challenges need to be addressed before edge computing can take off. Using a memory approach that performs on-chip computations in flash cells can remove the main hurdles. This approach leverages a production-proven, recognized standard type of multi-level memory technology solution that has been optimized for machine learning applications.

Reference address：The fusion of multi-level storage and analog in-memory computing solves AI edge problems

Previous article：Graphcore Launches IPU Developer Cloud to Solve the World’s Toughest AI Problems
Next article：How ON Semiconductor's intelligent sensing technology meets the challenges of industrial artificial intelligence applications

Recommended ReadingLatest update time:2024-11-22 20:20

Infineon Technologies Launches Next-Generation PSOC™ Edge Portfolio, Delivering Powerful AI Capabilities for IoT, Consumer and Industrial Applications

Infineon Technologies AG released details of the new PSOC™ Edge microcontroller (MCU) series, which is designed to be optimized for machine learning (ML) applications. The three new PSOC™ Edge MCU series E81, E83 and E84 are scalable and compatible in terms of performance, functionality and memory options.

[Internet of Things]

Infineon Technologies Launches Next-Generation PSOC™ Edge Portfolio, Delivering Powerful AI Capabilities for IoT, Consumer and Industrial Applications

Popular Resources
Popular amplifiers