Some Discussions on Very Long Instruction Set VLIW

Latest update time：2021-08-31 22:25

Reads：

Source: The content is compiled by Semiconductor Industry Observer (icbank) from " technews ", thank you.

In 2015, Russia's MCST (Moscow Center for SPARC Technologies) released the Elbrus-4C, a quad-core domestic processor. Two cores translated the x86 instruction set, and two cores were used for general purposes. Later, someone published graphics from the 2002 game The Elder Scrolls III: Morrowind on the Elbrus processor platform.

Currently, the Elbrus product line has developed to the 16-core, 2GHz Elbrus-16S. The high-performance processors of the former Soviet Union and Russian systems are a wonderful topic that deserves special introduction, and I will share it with readers later when I have the opportunity.

Excluding the header, a single instruction of Elbrus E2K can contain 1 to 15 instructions at most, and the instruction encoding length is between 64 and 512 bits (generally RISC is fixed at 32 bits). If you say this is not called "super long instruction", then it is really difficult to find a longer example.

The Elbrus series of processors was developed by Boris Babaian, the "father of Soviet supercomputers". The technical basis E2K revealed technical details as early as 1999 (although it was only in the Verilog hardware description language stage at that time and there was no real product yet). Elbrus E2K is also a processor that uses the Very Long Instruction Word (VLIW) architecture.

Just like the VLIW trend that was popular in the 1990s, from digital signal processors to general-purpose processors, from Philips' TriMedia, Slovak startup DanSoft, many digital signal processors (DSPs) from ADI/Lucent/Motorola/TI, Intel i860, Fujitsu FR-V, Transmeta Crusoe where Linus Torvalds once participated, Sun's MAJC, and Intel Itanium, there are too many examples to list.

In particular, the VLIW pioneers of the 1980s: Multiflow (Trace 7/300) and Cydrome (Cydra 5) were successively acquired by HP, giving rise to the VLIW-based PA-RISC processor project, which evolved into the well-known IA-64 instruction set and Intel Itanium processor.

What is the magical charm of VLIW that attracts so many manufacturers? Why has it almost disappeared from today's mainstream general-purpose processors? Let's continue to read on.

As superscalar processors become more complex

In the late 1980s, with the popularization of superpipeline, superscalar, out-of-order execution, and speculative execution, the number of instructions executed simultaneously by the processor increased sharply, and the circuit complexity also increased accordingly.

The superscalar architecture needs to fetch, decode, execute and write back two or more instructions per main frequency cycle, which will inevitably lead to resource conflicts, such as data dependency (an instruction requires the data of the following instruction), control dependency (an instruction needs to wait for the result of conditional judgment), and structural dependency (two instructions use a certain execution unit or register file at the same time). In order to deal with these problems, the processor microarchitecture will only become more sophisticated and complex, and the probability of product bugs will increase.

The rise of the RISC instruction set has certainly alleviated this difficulty, but it is still difficult to completely eradicate the problem. In order to continuously improve performance to cope with commercial competition, modern high-performance RISC processors are still stable and dinosaur-like, not to mention the CISC x86 system.

The Intel Pentium floating-point division bug incident told the world the cruel truth: just like software, processors can also have problems (especially the difficult x86 instruction set). Today's common mainstream processors all have a long list of "Errata Sheets", many of which are old problems that will never be fixed.

But it is one thing to have problems with shipped hardware that are difficult to fix. What if we "transfer" the complexity of the processor to software? Or even "bind" the instructions to the instruction set architecture in parallel? This is the background of the short-lived popularity of VLIW in the 1990s.

The concept originated from the horizontal microcode of the mainframe

The concept of VLIW has a very old origin. As early as 1946, Alan Turing (the protagonist of the movie "The Imitation Game"), the great father of computer science and artificial intelligence, had proposed the idea of horizontal microcode (each column horizontally corresponds to the circuit action, as opposed to vertical microcode that requires additional decoding). The details were completed by Maurice Wilkes, the creator of the term microcode, and played a key role in the birth of the IBM S/360 mainframe, which pioneered the "computer structure (symbolizing a traceably compatible instruction set architecture)".

Before the advent of IBM's S/360, each computer's instruction set was tailored for a specific application and was incompatible with each other. This is indeed a difficult scenario to imagine for those of us who are used to running the same operating system and applications on Intel, AMD, and VIA x86 processors.

Of course, microcode is not equal to instructions, but a means to implement instructions, such as generating control signals through microprograms composed of microcode. But in the 1970s, some special application computers that used horizontal microcode to implement control units, such as Floating Point Systems' products, introduced writable (no longer read-only) microcode memory, turning them into programmable VLIW computers. Then in the 1980s, smaller computer companies such as Multiflow, Cydrome and Culler attempted to build a general-purpose VLIW architecture mini supercomputer.

Bind instruction parallelism directly into the instruction set

The core spirit of VLIW is to "parallelize instructions directly into the instruction set architecture". One instruction "packages" a bunch of operations of different natures and feeds them to the processor at one time. Taking a simple two-operand addition as an example, the appearance of CISC, RISC and VLIW is as shown in the figure below, which is quite simple and easy to understand.

The most ideal VLIW is "one carrot, one hole", each instruction field with different computing properties directly corresponds to a dedicated execution unit, and there is no need for hardware to schedule and allocate instructions, all done by software. This also means that whether the binary execution code generated by the compiler can be "well packed" with optimized scheduling instructions will determine the practicality of VLIW. If the degree of optimization is not enough, it will only mean that the instructions are filled with a lot of NOPs (No Operations) that do nothing, which not only wastes execution units, but also hurts execution performance.

Software backtracking compatibility is a big problem for VLIW

The primary problem with VLIW is that the code is too large. Also from Intel, the VLIW IA-64 has a code size that is 3.7 to 4.8 times that of x86. This impact is fully reflected in the larger cache memory capacity and more efficient memory subsystem.

Binary code compatibility is also a major limitation of VLIW. If more execution units are added to processors in the future to reduce the time delay in executing instructions, the instruction set format and instruction scheduling must be modified. This also means that different versions of program code will make it much more difficult to port programs between VLIW processors of "different generations or different instruction dispatch widths" than with a superscalar architecture that dispatches instructions with hardware.

Although this does not mean that the superscalar architecture does not require software optimization (for example, Intel's own compiler has parameters corresponding to its own x86 processors, such as QxP, which is exclusive to the 90nm Pentium 4 "Prescott" and tries to use the newly added SSE3 instructions, and so on), but at least ensuring compatibility with past program codes is an undeniable major advantage. This is also the key factor that the superscalar architecture is still the mainstream of high-performance general-purpose processors today.

The last attempt to make VLIW universal: Intel IA-64

But VLIW is not a monolithic concept that must "follow the old ways". The IA-64 instruction set jointly developed by Intel and HP attempted to create a more flexible VLIW instruction set that does not tie the instruction format to the execution pipeline.

IA-64 packages three 41-bit instructions and a template bit that specifies the internal instruction sequence into a 128-bit instruction packet. Before the instruction packet is sent to the execution unit inside the processor, by reading the template bit, it can know in advance "what to do next" and allocate the required execution unit. The instructions are then "smoothly" executed sequentially. After the entire instruction packet is processed, it is returned (retire), still enjoying most of the advantages of VLIW, but the hardware may be slightly more complicated than pure VLIW.

However, VLIW, which is bundled with a bunch of operations, is particularly susceptible to pipeline stalls caused by conditional judgment processes. Trying to reduce branch instructions from the instruction set level is also the direction of IA-64 (and many VLIW instruction sets).

IA-64 is equipped with 64 software-controlled predication registers, which control the instruction execution flow through predication codes, replacing most simple conditional judgments. There is no need to gamble on the results of branch predictions. If the processor has abundant computing resources, it can "luxuriate in executing both sides at the same time and only retain the required results" (I am curious whether real Itanium applications really do this). To put it more mysteriously, predicated execution converts "control flow" into "data flow", or in other words, "control dependency" into "data dependency".

If you find reading program code annoying, then this picture is the right choice.

Speaking of quoted execution, by the way, in fact, GPGPU headed by nVidia G80 (Tesla), SIMT (single instruction multiple thread) architecture also relies on this thousand and one trick to control the execution process of large threads. This is also the biggest difference from SIMD (single instruction multiple data).

Itanium 9500 "Poulson" ends "general VLIW"

Now it seems that "universal VLIW" IA-64 is so beautiful, taking into account the simplicity of VLIW and superscalar compatibility, but there is no free lunch in the world. Intel's first three generations of Itanium cores (Merced, McKinley, Tukwila) are all simple static instruction scheduling. A single execution thread can execute up to two instruction packages (6 instructions) at the same time. In order to "complete the instructions and drive them away" as quickly as possible, sufficient internal resources must be prepared to match the combination of 6 instructions. Although it simplifies the data path inside the processor, it also wastes a huge internal execution unit, resulting in the enlargement of the Itanium core and making Itanium's single-thread performance unable to catch up with the x86 processor of the same period.

The Itanium 9500 "Poulson" later developed by the DEC Alpha team took a step back and completely denied the core value of VLIW. It directly made a dynamic instruction scheduling similar to superscalar. The instruction packets were disassembled at the front end of the pipeline. Different types of instructions had their own instruction scheduling queues, and the number of instruction packets that could be dispatched at the same time was doubled to 4, achieving higher processor utilization and a 25%~40% performance increase compared to the previous generation.

But this is also a disguised announcement that "General Purpose VLIW is Dead".

In 2000, when Intel and HP proudly released the Itanium processor and IA-64 instruction set and declared that "Out-Of-Order is Out-Of-Date", Martin Hopkins, then an IBM Fellow and designer of the early RISC research project IBM 801, launched a counterattack and criticism against Intel and HP. The textbook co-authored by the two RISC masters described Itanium as a "mediocre integer operation processor". In retrospect, all of these statements came true, which is regrettable.

Will VLIW make a comeback?

From the emergence of CISC, the reaction of RISC, to the fact that x86 still holds the mainstream position in servers and personal computers, we can clearly understand a simple fact: in different eras, based on different technical limitations and application considerations, which tasks should be handled by hardware and which tasks should be handled by software, the boundary between software and hardware has never been fixed, and there is no absolute standard. However, the concept of retroactive compatibility established by the IBM S/360 mainframe in 1964 is deeply rooted in people's minds and is still the top priority. It is also the threshold that VLIW has been unable to cross.

At a time when superscalar dynamically scheduled processors are still the mainstream, since Intel Itanium, VLIW has been almost invisible in general-purpose processors, except for GPUs (AMD's third-generation graphics architecture TeraScale), embedded applications or digital signal processing (DSP). Elbrus, a Russian-origin x86-compatible processor, has become an endangered rare beast. This is also the protagonist of the author's next article, so stay tuned.

*Disclaimer: This article is originally written by the author. The content of the article is the author's personal opinion. Semiconductor Industry Observer reprints it only to convey a different point of view. It does not mean that Semiconductor Industry Observer agrees or supports this point of view. If you have any objections, please contact Semiconductor Industry Observer.

Today is the 2271st issue of content shared by "Semiconductor Industry Observer" for you, welcome to follow.

Latest articlesabout

■SiC giant, rebirth: how to predict the future?

■Apple chips may hit Qualcomm hard

■Chip cost per car: soaring to $1,000

■TSMC 2nm, important information

■Huang Renxun's latest views

■The risks of this type of chips that are promising have increased significantly!

■NPU, how to see it?

■Storage giants are abandoning DDR 4

■Intel, why?

■Nvidia will definitely be disrupted

最新精华更多

Some Discussions on Very Long Instruction Set VLIW

Latest articlesabout