Information security is a hot research area in computer science and technology, and data encryption is an important means of information security. With the rapid development of programmable technology and the continuous emergence of high-speed integrated circuits, the use of FPGA to implement encryption algorithms has received more and more attention and importance [1][2]. Compared with traditional software encryption methods, the advantages of hardware encryption are: (1) good security and not easy to be attacked; (2) fast calculation speed and high efficiency; (3) low cost and reliable performance. An important performance indicator that reflects the data transmission speed in the encryption system is data throughput, which is calculated as: (data length M/number of clocks N) × clock frequency F. Improving data throughput is the key to improving the performance of the encryption system and is also an important part of the hardware implementation technology of encryption algorithms.
The AES algorithm is widely used as a substitute for the DES algorithm. There have been many discussions on its hardware implementation method, mainly by increasing the algorithm frequency to improve the throughput. However, in actual operation, in order to ensure the stability of the entire encryption system, the global clock frequency is usually low and it is impossible to reach the simulation frequency of the algorithm. For example, the clock frequency of the PCI interface circuit is only 33MHz, so the actual data throughput is still low. Based on the structural characteristics of the AES algorithm and the characteristics of the hardware system, this paper proposes a fast AES algorithm IP core design method. This method uses optimized round function design and pipeline technology to achieve higher throughput and faster transmission speed by reducing the number of block operation clocks at a lower system clock frequency.
1 AES Algorithm
AES is an advanced encryption standard proposed by the National Institute of Standards and Technology (NIST) of the United States [4]. In October 2000, NIST announced that the Rijndael algorithm submitted by Belgians Joan Daemen and Vicent Rijinen would be used as the Advanced Encryption Standard (AES). The algorithm is simple in design. Unlike public key cryptographic algorithms, it does not have complex multiplication operations, is easy to implement, and has strong flexibility. The good parallel characteristics of the round function are conducive to hardware design and implementation. The AES algorithm is an iterative block cipher with variable block length and key length, which can be independently specified as 128 bits, 192 bits, and 256 bits. This paper mainly discusses the case where the block length and key length are 128 bits.
The AES algorithm divides the input plaintext (or ciphertext) into 16 bytes and enters 10 rounds of iteration after the first Add Round Keys transformation. The first 9 rounds are exactly the same, and they go through byte substitution, row shift, column mixing, and round key addition. The last round skips Mix Column. The decryption process is similar to the encryption process, but the execution order and description content are different. Therefore, the encryption and decryption process of the AES algorithm needs to be implemented separately. Figure 1 shows the encryption and decryption process of the AES algorithm. For a detailed description of the AES algorithm, please refer to reference [4].
2 Optimization Design of AES Algorithm
2.1 Hardware Selection
Cyclone device is the lowest cost SRAM process FPGA launched by Altera, with a capacity of 2910 to 20060 logic units (LE) and up to 288kbit M4K memory block. Each LE has a four-input LUT lookup table, programmable registers and carry chain with carry selection capability, which can realize any function of four-input variables and perform a large number of logical operations, making it very suitable as a hardware carrier of encryption algorithm. The development tool used in the design is QuartusII4.2, and the FPGA chip selected is EP1C12Q240C8 of Cyclone device, which is implemented based on Verilog HDL language.
2.2 Optimization Design of Key Extension Unit
Key extension is to use the initial key as the seed key, and calculate and generate 10 rounds of iterative sub-keys through byte substitution, byte shift, round constant calculation, word XOR and other processes. Some literature proposes that the key extension and encryption process are executed simultaneously when calculating the key, which will save the memory resources of FPGA. However, the author believes that during the algorithm operation, the expansion key process is always running, which will increase the dynamic power consumption of the FPGA chip. In addition, the AES algorithm decryption operation starts from the last round of subkeys. Only when all subkeys are expanded can the decryption operation be started, which restricts the implementation of the decryption process. Usually, the initial key does not change frequently. The key expansion result can be shared when encrypting/decrypting multiple packets of data. At the same time, due to the rich memory resources of Cyclone devices, there is enough space for storing subkeys. Therefore, this article adopts a more common approach, which is to save all subkeys in RAM after expansion and read them from RAM in sequence when used. This method is not limited by the encryption and decryption implementation process, has strong flexibility, and is very suitable for the FPGA implementation of encryption algorithms.
It takes time to read subkeys from RAM. In order to avoid the delay caused by the first Add Round Key process of the algorithm, the first group of subkeys (initial keys) and the last group of subkeys (the first group of subkeys in the decryption process) can be written to RAM and saved in two groups of registers. As shown in Figure 2, the register results are directly used to enter the first round of iteration of the algorithm during encryption/decryption, ensuring that the algorithm completes 10 rounds of iteration within 10 clocks, thereby reducing the time of Add Round Key.
Although saving the key takes up about 256 more register resources, it is easier to implement the pipeline operation of the algorithm, which is very helpful to improve the overall performance. [page]
2.3 Optimization of round function design
Optimizing the round function design and reducing the delay of the round function are the key to improving the clock frequency of the algorithm. This paper improves the clock frequency of the algorithm by optimizing the Substitute Byte, Shift Row and Mix Column transformations in the round function.
Substitute Byte (S-Box) in the round function is a nonlinear operation on a byte. There are usually two ways to describe S-Box using Verilog HDL: (1) using case statements to describe the behavior, which occupies LE resources after synthesis; (2) using memory resources in FPGA. The S-Box of the AES algorithm is an 8-input and 8-output lookup table. If it is implemented with LE, each S-Box in the Cyclone chip requires 208 LEs. The parallel operation of the AES algorithm requires 32 S-Boxes, a total of 6656 LEs, which not only occupies a large amount of hardware resources, but also makes the structure complex and increases the delay. Using memory resources to implement S-Box does not require other hardware resources and can reduce delay, which is a good choice. This article adopts this method and makes full use of device resources: each memory block in the Cyclone device can be designed as a 256×16bit ROM, and the S-Box of the encryption/decryption process is designed in the same ROM. The encrypted S-Box content is placed in the first 8bit of the ROM, and the decrypted S-Box content is placed in the last 8bit of the ROM. This can reduce the number of memory blocks by half compared to the design using ROM separately, greatly improving resource utilization. The hardware implementation of Shift Row is very simple, just a wiring operation. In order to further reduce the delay caused by wiring, Substitute Byte and Shift Row are combined into one, so that the delay of the two parts depends only on the ROM of the S-Box. Mix Column transform is defined as the matrix multiplication of a quartic polynomial with coefficients over the finite field GF(28)[4]. The input column vector (X0, X1, X2, X3) is the column vector (Y0, Y1, Y2, Y3). The encryption process is multiplication by 01, 02, 03 over GF(28). The decryption process is relatively complex, multiplication by 09, 0E, 0B, 0D over GF(28). In order to simplify the design for FPGA implementation, the matrix multiplication can be expanded and organized to obtain the following results:
Encrypt Mix Column:
Decrypting Mix Column:
Where a is a transformation function for a byte, and its Verilog HDL description is:
a={b[6:0], 1′b0}^(8′h1b&{8{b[7]}});
After sorting, the implementation process of Mix Column is simplified, saving hardware resources. Add Round Key is just a simple XOR, which occupies less resources. After the optimization design, the maximum delay of the round function is only 8.6ns, which provides a guarantee for improving the clock frequency of the entire design.
3 Fast Implementation of AES Algorithm
3.1 Hardware Implementation of AES Algorithm
The round operation characteristics of the AES algorithm make its hardware implementation possible in many ways [3]: (1) Serial operation: The round function is implemented using combinational logic, and the 10 rounds of iteration processes are directly connected. The result of the previous round is directly used as the input of the next round. A group operation is completed within 1 clock cycle, and the throughput can reach the best state. (2) Basic iteration: Using the feedback mode, all iterations use only one round function, and a group operation is completed in 10 clock cycles. (3) Intra-round pipeline: insert registers into each round function, divide a round of operation into multiple operation segments, and complete one operation segment per clock. This method has been discussed and used by many people. Its advantage is that it can increase the clock frequency of the algorithm.
Among the above AES algorithm implementation methods, method (1) requires a large amount of register resources and combinational logic resources to support the 10 round functions working at the same time, and also increases the delay. It is difficult for general FPGA chips to meet the capacity requirements, and the clock frequency is also very low. Therefore, this method is not suitable for hardware implementation of encryption algorithms. Method (2) is simple to implement and occupies less resources, but the operation time of each group is relatively long, and the throughput is still relatively low. In method (3), due to the round operation characteristics of the encryption algorithm, the pipeline components at all levels in the round cannot be executed simultaneously, which increases the number of clocks for the algorithm to run. The more pipeline levels there are in the round, the more clocks there are. Although the algorithm simulation frequency can be very high, the throughput is not significantly improved due to the influence of the global clock of the hardware encryption system.
After analyzing several implementation methods of the above algorithm, this paper proposes a faster FPGA implementation scheme of the AES algorithm based on pipeline technology. This scheme can achieve high throughput even when the global clock frequency is low. [page]
3.2 Pipeline Design
The AES algorithm has a simple structure and only requires logical operations and lookup table operations. The author optimizes the round function design so that the clock frequency in the basic iteration mode is much higher than the clock frequency of the PCI interface 33MHz. On the basis of meeting the algorithm clock frequency, this paper improves the throughput by reducing the processing time of the algorithm packet data. The specific approach is: using a two-stage off-round pipeline, the 10-round iteration process of the AES algorithm is divided into two operation segments, front and back. Each operation segment can be used as a first-level pipeline. Within the operation segment, 5 basic iterations are completed between each round in a feedback (FB) manner. After the previous operation segment is completed, the result is directly sent to the second operation segment to process the next packet data at the same time. The two operation segments do not affect each other and are executed in parallel. Considering that the data bus width (such as PCI bus) in practical applications is usually 32 bits, the data width of the AES algorithm IP core is set to 32 bits, and one packet data is input/output in 4 clocks. In order to match the 5-round iteration process of each pipeline level, a round of empty operation is performed in the 5nth clock of the input/output packet data, so that the four-step operation of inputting plaintext data, outputting ciphertext results, first-stage pipeline and second-stage pipeline is executed simultaneously, thereby realizing the pipeline process shown in Figure 3. After obtaining the first packet result, a packet result will be generated every 5 clocks. From the outside, it only takes 5 clocks to complete a packet.
3.3 Experimental results and performance analysis
The design was synthesized in QuartusⅡ4.2 software, and the highest simulation frequency was 78.38MHz, which can fully meet the requirements of lower global clock frequency. The whole system design uses a 33MHz clock. The experimental test results show that the throughput has reached 810Mbps. If the global clock frequency is increased, the throughput will exceed 1Gbps.
According to the characteristics of the AES algorithm and the characteristics of the hardware encryption system, a fast hardware design scheme for the AES algorithm IP core is given. By using pipeline technology and optimized design, a very high data throughput can be obtained at a lower frequency, so that the FPGA implementation process of the encryption algorithm is no longer a bottleneck of the transmission speed. The whole design is very practical, stable in operation, and has good results. For the case where the AES algorithm packet length and key length are 192bit and 256bit, the number of execution rounds increases due to different packet lengths. To achieve pipeline operation and achieve better results in resource utilization and throughput, further optimization of the design is needed, which is also the direction of future research.
References
1 Standaert.Efficient implementation of rijndael encryption in reconfigurable hardware: improvements and design tradeoffs.CHES 2003, LNCS 2779: 334~350
2 Saggese.An FPGA-based performance analysis of the unrolling, tiling and pipelining of the AESAlgorithm.FPL 2003, LNCS 2778: 292~302
3 Gaj K, Pawel Chodowie. Comparison of the hardware perfor-mance of the AES candidates using reconfigurable hardware
4 Danmen J, Riijmen V. AES Proposal: rijndael. AES algorithm submission. AES home page: http://www.nist.gov/aes, 1999-09-03
Previous article:Design of PXA270 Peripheral Timing Conversion Interface Based on FPGA
Next article:High-performance video system design under the development trend of multi-core processors
Recommended ReadingLatest update time:2024-11-16 16:45
- Popular Resources
- Popular amplifiers
- Analysis and Implementation of MAC Protocol for Wireless Sensor Networks (by Yang Zhijun, Xie Xianjie, and Ding Hongwei)
- MATLAB and FPGA implementation of wireless communication
- Intelligent computing systems (Chen Yunji, Li Ling, Li Wei, Guo Qi, Du Zidong)
- Summary of non-synthesizable statements in FPGA
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- What are the two parts of RFID?
- How to use the virtual serial port VCOM function of J-Link?
- HD7279A controls digital tube (IAR environment)
- This mood light seems more suitable for festivals
- Bumpy B-U585I-IOT02A review, BLE
- Today, let's answer the question about how many channels of HD video codecs TX1 supports.
- Instructions for implementing interrupts in F2812 using C language
- Analog circuit in ufun
- Principles of Adaptive Filters (5th Edition)
- 10-channel logic analyzer based on VGA display.pdf