Fast Hardware Design and Implementation of AES Algorithm-EEWORLD

Collect

Information security is a hot research area in computer science and technology, and data encryption is an important means of information security. With the rapid development of programmable technology and the continuous emergence of high-speed integrated circuits, the use of FPGA to implement encryption algorithms has received more and more attention and importance [1][2]. Compared with traditional software encryption methods, the advantages of hardware encryption are: (1) good security and not easy to be attacked; (2) fast calculation speed and high efficiency; (3) low cost and reliable performance. An important performance indicator that reflects the data transmission speed in the encryption system is data throughput, which is calculated as: (data length M/number of clocks N) × clock frequency F. Improving data throughput is the key to improving the performance of the encryption system and is also an important part of the hardware implementation technology of encryption algorithms.

The AES algorithm is widely used as a substitute for the DES algorithm. There have been many discussions on its hardware implementation method, mainly by increasing the algorithm frequency to improve the throughput. However, in actual operation, in order to ensure the stability of the entire encryption system, the global clock frequency is usually low and it is impossible to reach the simulation frequency of the algorithm. For example, the clock frequency of the PCI interface circuit is only 33MHz, so the actual data throughput is still low. Based on the structural characteristics of the AES algorithm and the characteristics of the hardware system, this paper proposes a fast AES algorithm IP core design method. This method uses optimized round function design and pipeline technology to achieve higher throughput and faster transmission speed by reducing the number of block operation clocks at a lower system clock frequency.

1 AES Algorithm

AES is an advanced encryption standard proposed by the National Institute of Standards and Technology (NIST) of the United States [4]. In October 2000, NIST announced that the Rijndael algorithm submitted by Belgians Joan Daemen and Vicent Rijinen would be used as the Advanced Encryption Standard (AES). The algorithm is simple in design. Unlike public key cryptographic algorithms, it does not have complex multiplication operations, is easy to implement, and has strong flexibility. The good parallel characteristics of the round function are conducive to hardware design and implementation. The AES algorithm is an iterative block cipher with variable block length and key length, which can be independently specified as 128 bits, 192 bits, and 256 bits. This paper mainly discusses the case where the block length and key length are 128 bits.

The AES algorithm divides the input plaintext (or ciphertext) into 16 bytes and enters 10 rounds of iteration after the first Add Round Keys transformation. The first 9 rounds are exactly the same, and they go through byte substitution, row shift, column mixing, and round key addition. The last round skips Mix Column. The decryption process is similar to the encryption process, but the execution order and description content are different. Therefore, the encryption and decryption process of the AES algorithm needs to be implemented separately. Figure 1 shows the encryption and decryption process of the AES algorithm. For a detailed description of the AES algorithm, please refer to reference [4].

2 Optimization Design of AES Algorithm

2.1 Hardware Selection

Cyclone device is the lowest cost SRAM process FPGA launched by Altera, with a capacity of 2910 to 20060 logic units (LE) and up to 288kbit M4K memory block. Each LE has a four-input LUT lookup table, programmable registers and carry chain with carry selection capability, which can realize any function of four-input variables and perform a large number of logical operations, making it very suitable as a hardware carrier of encryption algorithm. The development tool used in the design is QuartusII4.2, and the FPGA chip selected is EP1C12Q240C8 of Cyclone device, which is implemented based on Verilog HDL language.

2.2 Optimization Design of Key Extension Unit

Key extension is to use the initial key as the seed key, and calculate and generate 10 rounds of iterative sub-keys through byte substitution, byte shift, round constant calculation, word XOR and other processes. Some literature proposes that the key extension and encryption process are executed simultaneously when calculating the key, which will save the memory resources of FPGA. However, the author believes that during the algorithm operation, the expansion key process is always running, which will increase the dynamic power consumption of the FPGA chip. In addition, the AES algorithm decryption operation starts from the last round of subkeys. Only when all subkeys are expanded can the decryption operation be started, which restricts the implementation of the decryption process. Usually, the initial key does not change frequently. The key expansion result can be shared when encrypting/decrypting multiple packets of data. At the same time, due to the rich memory resources of Cyclone devices, there is enough space for storing subkeys. Therefore, this article adopts a more common approach, which is to save all subkeys in RAM after expansion and read them from RAM in sequence when used. This method is not limited by the encryption and decryption implementation process, has strong flexibility, and is very suitable for the FPGA implementation of encryption algorithms.

It takes time to read subkeys from RAM. In order to avoid the delay caused by the first Add Round Key process of the algorithm, the first group of subkeys (initial keys) and the last group of subkeys (the first group of subkeys in the decryption process) can be written to RAM and saved in two groups of registers. As shown in Figure 2, the register results are directly used to enter the first round of iteration of the algorithm during encryption/decryption, ensuring that the algorithm completes 10 rounds of iteration within 10 clocks, thereby reducing the time of Add Round Key.

Although saving the key takes up about 256 more register resources, it is easier to implement the pipeline operation of the algorithm, which is very helpful to improve the overall performance. [page]

2.3 Optimization of round function design

Optimizing the round function design and reducing the delay of the round function are the key to improving the clock frequency of the algorithm. This paper improves the clock frequency of the algorithm by optimizing the Substitute Byte, Shift Row and Mix Column transformations in the round function.

Substitute Byte (S-Box) in the round function is a nonlinear operation on a byte. There are usually two ways to describe S-Box using Verilog HDL: (1) using case statements to describe the behavior, which occupies LE resources after synthesis; (2) using memory resources in FPGA. The S-Box of the AES algorithm is an 8-input and 8-output lookup table. If it is implemented with LE, each S-Box in the Cyclone chip requires 208 LEs. The parallel operation of the AES algorithm requires 32 S-Boxes, a total of 6656 LEs, which not only occupies a large amount of hardware resources, but also makes the structure complex and increases the delay. Using memory resources to implement S-Box does not require other hardware resources and can reduce delay, which is a good choice. This article adopts this method and makes full use of device resources: each memory block in the Cyclone device can be designed as a 256×16bit ROM, and the S-Box of the encryption/decryption process is designed in the same ROM. The encrypted S-Box content is placed in the first 8bit of the ROM, and the decrypted S-Box content is placed in the last 8bit of the ROM. This can reduce the number of memory blocks by half compared to the design using ROM separately, greatly improving resource utilization. The hardware implementation of Shift Row is very simple, just a wiring operation. In order to further reduce the delay caused by wiring, Substitute Byte and Shift Row are combined into one, so that the delay of the two parts depends only on the ROM of the S-Box. Mix Column transform is defined as the matrix multiplication of a quartic polynomial with coefficients over the finite field GF(28)[4]. The input column vector (X0, X1, X2, X3) is the column vector (Y0, Y1, Y2, Y3). The encryption process is multiplication by 01, 02, 03 over GF(28). The decryption process is relatively complex, multiplication by 09, 0E, 0B, 0D over GF(28). In order to simplify the design for FPGA implementation, the matrix multiplication can be expanded and organized to obtain the following results:

Encrypt Mix Column:

Decrypting Mix Column:

Where a is a transformation function for a byte, and its Verilog HDL description is:

a={b[6:0], 1′b0}^(8′h1b&{8{b[7]}});

After sorting, the implementation process of Mix Column is simplified, saving hardware resources. Add Round Key is just a simple XOR, which occupies less resources. After the optimization design, the maximum delay of the round function is only 8.6ns, which provides a guarantee for improving the clock frequency of the entire design.

3 Fast Implementation of AES Algorithm

3.1 Hardware Implementation of AES Algorithm

The round operation characteristics of the AES algorithm make its hardware implementation possible in many ways [3]: (1) Serial operation: The round function is implemented using combinational logic, and the 10 rounds of iteration processes are directly connected. The result of the previous round is directly used as the input of the next round. A group operation is completed within 1 clock cycle, and the throughput can reach the best state. (2) Basic iteration: Using the feedback mode, all iterations use only one round function, and a group operation is completed in 10 clock cycles. (3) Intra-round pipeline: insert registers into each round function, divide a round of operation into multiple operation segments, and complete one operation segment per clock. This method has been discussed and used by many people. Its advantage is that it can increase the clock frequency of the algorithm.

Among the above AES algorithm implementation methods, method (1) requires a large amount of register resources and combinational logic resources to support the 10 round functions working at the same time, and also increases the delay. It is difficult for general FPGA chips to meet the capacity requirements, and the clock frequency is also very low. Therefore, this method is not suitable for hardware implementation of encryption algorithms. Method (2) is simple to implement and occupies less resources, but the operation time of each group is relatively long, and the throughput is still relatively low. In method (3), due to the round operation characteristics of the encryption algorithm, the pipeline components at all levels in the round cannot be executed simultaneously, which increases the number of clocks for the algorithm to run. The more pipeline levels there are in the round, the more clocks there are. Although the algorithm simulation frequency can be very high, the throughput is not significantly improved due to the influence of the global clock of the hardware encryption system.

After analyzing several implementation methods of the above algorithm, this paper proposes a faster FPGA implementation scheme of the AES algorithm based on pipeline technology. This scheme can achieve high throughput even when the global clock frequency is low. [page]

3.2 Pipeline Design

The AES algorithm has a simple structure and only requires logical operations and lookup table operations. The author optimizes the round function design so that the clock frequency in the basic iteration mode is much higher than the clock frequency of the PCI interface 33MHz. On the basis of meeting the algorithm clock frequency, this paper improves the throughput by reducing the processing time of the algorithm packet data. The specific approach is: using a two-stage off-round pipeline, the 10-round iteration process of the AES algorithm is divided into two operation segments, front and back. Each operation segment can be used as a first-level pipeline. Within the operation segment, 5 basic iterations are completed between each round in a feedback (FB) manner. After the previous operation segment is completed, the result is directly sent to the second operation segment to process the next packet data at the same time. The two operation segments do not affect each other and are executed in parallel. Considering that the data bus width (such as PCI bus) in practical applications is usually 32 bits, the data width of the AES algorithm IP core is set to 32 bits, and one packet data is input/output in 4 clocks. In order to match the 5-round iteration process of each pipeline level, a round of empty operation is performed in the 5nth clock of the input/output packet data, so that the four-step operation of inputting plaintext data, outputting ciphertext results, first-stage pipeline and second-stage pipeline is executed simultaneously, thereby realizing the pipeline process shown in Figure 3. After obtaining the first packet result, a packet result will be generated every 5 clocks. From the outside, it only takes 5 clocks to complete a packet.

3.3 Experimental results and performance analysis

The design was synthesized in QuartusⅡ4.2 software, and the highest simulation frequency was 78.38MHz, which can fully meet the requirements of lower global clock frequency. The whole system design uses a 33MHz clock. The experimental test results show that the throughput has reached 810Mbps. If the global clock frequency is increased, the throughput will exceed 1Gbps.

According to the characteristics of the AES algorithm and the characteristics of the hardware encryption system, a fast hardware design scheme for the AES algorithm IP core is given. By using pipeline technology and optimized design, a very high data throughput can be obtained at a lower frequency, so that the FPGA implementation process of the encryption algorithm is no longer a bottleneck of the transmission speed. The whole design is very practical, stable in operation, and has good results. For the case where the AES algorithm packet length and key length are 192bit and 256bit, the number of execution rounds increases due to different packet lengths. To achieve pipeline operation and achieve better results in resource utilization and throughput, further optimization of the design is needed, which is also the direction of future research.

References

1 Standaert.Efficient implementation of rijndael encryption in reconfigurable hardware: improvements and design tradeoffs.CHES 2003, LNCS 2779: 334～350

2 Saggese.An FPGA-based performance analysis of the unrolling, tiling and pipelining of the AESAlgorithm.FPL 2003, LNCS 2778: 292～302

3 Gaj K, Pawel Chodowie. Comparison of the hardware perfor-mance of the AES candidates using reconfigurable hardware

4 Danmen J, Riijmen V. AES Proposal: rijndael. AES algorithm submission. AES home page: http://www.nist.gov/aes, 1999-09-03

Keywords：FPGA Reference address：Fast Hardware Design and Implementation of AES Algorithm

Previous article：Design of PXA270 Peripheral Timing Conversion Interface Based on FPGA
Next article：High-performance video system design under the development trend of multi-core processors

Recommended ReadingLatest update time:2024-11-16 16:45

Design of Bus Interface between FPGA and ADSP TS201

In the fields of radar signal processing and digital image processing, the real-time performance of signal processing is crucial. Due to the advantages of FPGA chips in the underlying algorithm processing of large amounts of data and the advantages of DSP chips in the processing of complex algorithms, the applicatio

[Embedded]

Design of Bus Interface between FPGA and ADSP TS201

Design of anti-jitter double frequency circuit for incremental photoelectric encoder

In some industrial automatic control fields and some equipment applications, we often encounter various occasions where length measurement is required. At present, photoelectric encoders are usually used. Photoelectric encoders can be divided into three types according to their scale method and signal output form: i

[Embedded]

Design of anti-jitter double frequency circuit for incremental photoelectric encoder

An improved algorithm for FPGAs reduces square wave ghosting

DDS (Direct Digital Frequency Synthesis) frequency synthesizer can easily output arbitrary waveforms . As one of the most commonly used waveforms, square wave has its own particularity. However, the output square wave has obvious ghosting phenomenon, which directly affects the quality of the square wave. 1. Reasons fo

[Power Management]

An improved algorithm for FPGAs reduces square wave ghosting

Design and Research of Arbitrary Waveform Generator Based on FPGA

Arbitrary Waveform Generator (AWG) is a multi-wave signal generator . It can not only generate conventional waveforms such as sine waves and exponential waves, but also show diversified carrier modulation, such as frequency modulation, amplitude modulation, phase modulation and pulse modulation. It can al

[Embedded]

Design and Research of Arbitrary Waveform Generator Based on FPGA

Day 6 of FPGA introductory learning (DDS signal generator)

Purpose Use FPGA to realize the function of signal generator and generate a sine wave signal of a certain frequency Master the principles of DDS Learn to generate ROM IPCORE Learn to emulate ROM IPCORE experiment platform Elf development board (DA module is connected to J4 port of the development board) Black Gold D

[Test Measurement]

Day 6 of FPGA introductory learning (DDS signal generator)

A brief analysis of the rich wiring resources in FPGA chips

Wiring resources connect all units inside the FPGA, and the length and process of the connection determine the driving ability and transmission speed of the signal on the connection. There are abundant wiring resources inside the FPGA chip, which are divided into four different categories according to the process, leng

[Analog Electronics]

Hardware Design of CCD Image Acquisition System Based on FPGA and DSP

When the video signal generated by CCD is an analog signal pair, it is difficult to directly transmit, store and process it. It is necessary to convert the analog video signal into a digital video signal in order to process it and transmit and store it efficiently and reliably. At present, digital image acquisition

[Embedded]

Design and implementation of 99-hour timer based on VHDL language

0 Introduction The traditional timer hardware connection is complicated, the reliability is poor, and the timing time is short, which is difficult to meet the needs. This design uses programmable chips and VHDL language for software and hardware design, which not only greatly simplifies the hardware, but al

[Embedded]

Design and implementation of 99-hour timer based on VHDL language

Popular Resources
Popular amplifiers