At present, in most confidential communication equipment, general-purpose CPUs and dedicated hardware circuits are mainly used to control cryptographic chips to implement two types of cryptographic operations. When using the former to control cryptographic chips, it is necessary to select a general-purpose microprocessor GPP (General Purpose Processor) with high flexibility, easy maintenance, and convenient upgrades. However, due to the limitations of general-purpose microprocessor instructions, the cryptographic chip cannot achieve its optimal performance, which seriously affects the speed of confidential communication; using dedicated hardware circuits to directly control the cryptographic chip can maximize the performance of the cryptographic chip, but because its functions only rely on the cryptographic chip and its peripheral devices, it has poor flexibility and a relatively long development cycle.
It can be seen that no matter which of the above methods is adopted, the separation of operation and control of the cryptographic dedicated chip limits the cryptographic data processing performance and restricts the overall speed of the system. In response to the above problems, by analyzing a variety of cryptographic algorithms, this paper proposes an explicit parallel instruction computing structure based on the processor design concept (EPIC programmable cryptographic processor architecture, which achieves a compromise between speed and flexibility.
1 Cryptographic Algorithm Analysis
1.1 Typical cryptographic algorithms and their applications
Seven block cipher algorithms and two hash functions, namely DES, IDEA, Rijndael, RC6, Serpent, Twofish, Mars, MD5 and SHA, are analyzed.
A block cipher algorithm is a bijective function that maps a bit of plaintext into an n-bit ciphertext, where n is the block length. Its encryption and decryption processes have the same key, so it is also called a symmetric cipher algorithm. A hash function is a function that compresses a message of any length into a message digest of a fixed length. It is mainly used in digital signatures, message integrity detection, and message origin authentication detection.
The DES algorithm (Data Encryption Standard) is the first generation of publicly available and fully described block cipher algorithms that are recognized worldwide. Its original designer was IBM, which obtained its patent. In the following two decades, the DES algorithm, as a typical block cipher algorithm, has been widely used to protect the security of commercial data (such as banking systems, etc.).
The IDEA algorithm (International Data Encryption Algorithm) was published in 1992 and meets the IPES standard. It is well-known for its wide application in email encryption and authentication software (PGP).
Riindael was announced in 1998 and won the AES selection hosted by NIST (National Institute of Standards and Technology) in 2000. Since then, the Rijndael algorithm has also been called the AES algorithm, becoming a new encryption standard that gradually replaces DES.
RC6, Serpent, Twofish and Mars algorithms are candidate algorithms for AES evaluated together with the Rijndael algorithm. They all embody the design principles of block cipher algorithms to varying degrees and have had a considerable impact on the development of applied cryptography.
The MD5 message digest function is a one-way hash function proposed by Rivest, one of the designers of the RSA algorithm. It is not based on any assumptions or cryptographic systems, uses a direct construction method, and has a very fast processing speed.
SHA is the secure hash standard of the Federal Information Processing Standard (FIPS-180) published in 1993. It was proposed by NIST and its revised version, commonly known as SHA-1, was launched in 1995.
1.2 Basic Operations in Cryptographic Algorithms
Based on the analysis of the above algorithms, the core operation types of each algorithm are extracted, and their basic operations are summarized into the following six categories: S-box operation, bit permutation operation, arithmetic operation, logical operation, shift operation and finite field multiplication operation. Among them, arithmetic operation includes modular addition/subtraction and modular multiplication operation, and logical operation consists of 'and i, 'or i, 'not i and 'xor i. Table 1 lists their specific applications in various algorithms in detail. For example, the DES algorithm mainly uses S-box operation, bit permutation, xor and shift operation.
2. Design of Programmable Cryptographic Processor Architecture
In the typical AFC (Analog Programmable Cryptographic Processor Architecture), the EPIC architecture exploits random concurrency between scalar operations and increases the number of functional units. Unrelated instructions are explicitly compiled into an extra-long machine instruction word and emitted to the pipeline for concurrent execution in each functional unit, with an instruction-level parallelism of 4 to 8. The hardware control of this structure is relatively simple, and the inherent parallelism is obvious in computationally intensive applications. And it does not require a lot of branch prediction. Running instructions on this structure can achieve a high degree of actual instruction-level parallelism. It is precisely because of the above characteristics that the EPIC structure largely meets the requirements of cryptographic algorithms, that is, computationally intensive and sequential execution.
The hardware structure of the programmable cryptographic processor architecture is shown in Figure 1. The entire processor consists of three parts: data path, control unit, and input/output interface circuit. [page]
The data path is one of the key components of the processor, including 6 functional units FUO~FU5 that can be executed in parallel, 32 32-bit general registers, 4×32 32-bit key registers and a write-back unit.
The functional unit is the core of the processor to execute instruction operations, and is composed of several cryptographic operation modules. Among them, the composition and structure of the internal operation modules of FUO~FU3 are exactly the same. The input is 3 32-bit operation data, 2 of which come from the general register stack and 1 comes from the key register stack, and the output operation result is also 32 bits. FUO~FU3 are respectively set up with 7 operation modules, namely S-box operation module, modular addition and subtraction operation module, modular multiplication operation module, 32-bit shift operation module, finite field multiplication operation module, two-input logic operation module, and three-input logic operation module. FU4 is set up with a 128-bit permutation operation module, and the input is 12 32-bit operation data, 8 of which come from the general register stack and 4 from the key register stack. FU5 is set up with a 128-bit shift operation module, and the input is also 12 32-bit operation data, 8 of which come from the general register stack and 4 from the key register stack.
The functions of the above-mentioned computing modules are not single, but reconfigurable. Table 2 shows the modes supported by the four reconfigurable computing modules.
In addition to the reconfigurable operation mode mentioned above, each operation module also supports adding XOR operations before the operation, adding XOR operations after the operation, or adding XOR operations before and after the operation according to specific circumstances. Since the delay of the XOR operation is very small, its addition does not affect the critical path of the operation, which reduces the clock of a single XOR operation during cryptographic operations, thereby reducing the number of clocks for the entire operation without affecting the overall performance. Table 3 shows the round operation process of the Rijndael algorithm. By adding the XOR operation after the finite field multiplication operation, the number of clock cycles is reduced from 4 to 3, and 10 rounds of operations will reduce 10 clock cycles.
The control unit completes tasks such as instruction access, instruction decoding, instruction memory address generation, and coordinates the correct execution of processor internal instructions and external user commands. [page]
The input/output interface circuit includes 16 32-bit input registers, 16 32-bit output registers, 4 data length counters, 1 32-bit command register, etc., which completes operations such as loading instructions and operation data from the 32-bit data bus to the instruction memory and input registers, and writing the operation results from the internal general registers to the output registers.
3. Instruction System Design
The instruction system is a concentrated embodiment of the algorithm elements and the characteristics of the cryptographic processor architecture. The design of the instruction system must support the parallel execution of hardware, that is, the development of instruction-level parallelism (ILP). The degree of development of instruction-level parallelism is critical to give full play to the hardware characteristics of the cryptographic microprocessor and improve the performance of program operation. ILP technology actually refers to a complete set of processor design and compilation technologies, which accelerate the execution of programs by executing independent machine operations (such as memory reading and writing, logical operations, arithmetic operations, etc.) in parallel. The size of ILP can be measured by the average number of instructions executed per cycle (IPC), or by the average number of cycles executed per instruction of the entire program (CPI) (CPT=l/IPC). In the programmable cryptographic processor architecture, an explicit parallel instruction calculation structure is adopted, and the instruction-level parallelism reaches 5.
3.1 Instruction Classification
Instructions in programmable cryptographic processor architectures are classified into the following categories:
(1) Static configuration instructions. These are control information configuration instructions that remain unchanged or change very rarely during key generation and encryption/decryption. Once the algorithm is determined, its S-box lookup table information, finite field multiplier matrix and irreducible polynomial, and several permuted control information are determined. They will not change due to different operation modes. The method of separating configuration instructions during encryption/decryption can greatly reduce the redundant encoding of instructions when performing cryptographic operations, thereby shortening the length of instruction words, increasing the number of valid operations contained in the operation instruction words, effectively improving the encryption/decryption speed, and reducing the amount of code in the cryptographic program.
(2) Short instructions: They perform various cryptographic operations and data transfer operations between internal registers except for permutation and 128-bit shift operations.
(3) Long instruction. It performs permutation and 128-bit shift operations.
(4) Super long instruction. It performs immediate value operation and multi-branch judgment operation.
(5) Control instructions: They execute control operations such as program jump, subroutine call and return, and single branch judgment.
3.2 Instruction form
In hardware, the setting of multiple functional units provides support for the parallel execution of multiple instructions, and the principles of which instructions can be executed in parallel, which instructions cannot be executed in parallel, and how to assemble multiple instructions into one instruction are called instruction assembly rules. In this design, there are the following instruction forms:
(1) Static configuration instructions.
(2) Extra long instructions.
(3) Short instruction II short instruction II short instruction II short instruction ll control instruction.
(4) Long instruction II control instruction.
The length of short instructions is 37 bits, the length of control instructions is 32 bits, and the length of long instructions is 148 bits. Regardless of the above forms, the final instruction word length is 192 bits (including instruction assembly identifiers). For example, four short instructions can be assembled into one instruction with a control instruction, and long instructions can also be assembled into one instruction with a control instruction, but static configuration instructions and super long instructions cannot be assembled with other instructions to form a 192-bit instruction word.
4 Performance Analysis
Since the programmable cryptographic processor architecture supports the parallel execution of 5 instructions, its data path is defined as 5CS (5 Combining-Strands). Assuming that the data path without binding is defined as NCS (No-Combining-Strands), these two cases are compared with the Alpha processor and the Cryp-toManiac cryptographic processor [9]. The number of clocks required for encryption/decryption under the four data paths is shown in Table 4. The analysis and comparison shows that the execution clock of the programmable cryptographic processor is greatly reduced, especially compared with the general-purpose processor Alpha. The number of clocks for encryption/decryption is reduced by 83% for the DES algorithm, 92% for the IDEA algorithm, 91% for the Rijndael algorithm, 69% for the RC6 algorithm, and 78% for the Twofish algorithm.
In order to verify the correctness of the data path and control path of the programmable cryptographic processor architecture, the Altera StraTIxlIEP2S180F1508C4 device is used as the FPCA target chip, and the Altera Quartus II 5.0 tool is used for synthesis. Before and after synthesis, Mentor's ModelSim 5.8c is used for functional simulation and timing simulation respectively, and the results are correct. The specific resource usage is shown in Table 5.
The flexibility and efficiency of cryptographic processing have always been the limiting factors in the use of cryptographic algorithms. Although the use of general-purpose microprocessors can achieve better flexibility, the performance of some algorithms cannot meet the requirements; the use of dedicated algorithm chips loses flexibility while achieving high performance. In response to this contradiction, this paper takes the EPIC structure microprocessor architecture as the starting point, systematically studies the general parallel block cipher processor model, various cryptographic operation units, instruction sets and other key technologies, and finally realizes it, achieving a good compromise between performance and flexibility.
Previous article:Design of Multi-channel Optical Power Meter Based on MSP430F2272 Single Chip Microcomputer
Next article:Voltage doubler ladder based on PIC12F built-in comparator
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- How to set up a digital oscilloscope to observe eye diagrams without eye diagram analysis software
- [Analog Electronics Course Selection Test] + Basic Knowledge of Operational Amplifiers
- High precision amplifier circuit
- Initialization of MSP430F5529 ADC
- How to choose the capacitor withstand voltage at the power supply end @ [Analog Electronics]
- SystemVerilog and Functional Verification (1)
- MCEWizard software usage for EVAL-M3-TS6-665PN development board
- Have you ever played with any interesting sensors?
- MEMS sensor with AI programmable core (ISPU - intelligent sensor processing unit)
- Application of Finite State Machine in Embedded Software