32/28nm node design solution based on Talus Vortex FX-EEWORLD

Collect

Preface

Current high-end ASIC /ASSP/SoC device developers can be considered to fall into three categories: mainstream, early adopters, and technology leaders. At the time of this writing, mainstream developers are working on 65nm technology node designs, early adopter developers are focusing on 45/40nm node designs, and technology leader developers are striving to move beyond 32/28nm and smaller size node designs. With the increasing pace of technology adoption development, it will not be long before the next generation of early adopters transition to the 32/28nm node, and their mainstream developer counterparts will follow closely behind.

There are many issues that arise when designing for the 32/28nm node, including: low-power design, crosstalk effects, process variation, and a significant increase in the number of operating modes and corners. This article will first present you with a high-level view of Magma's Talus® Vortex1.2 physical implementation flow, and then introduce some of the issues involved in 32/28nm node design and describe how TalusVortex1.2 solves these issues.

In addition to the technical issues mentioned above, the increasing size and complexity of designs at the 32/28nm node also creates issues related to increased engineering resources (to achieve greater results without increasing team size while maintaining or even shortening existing schedules), hardware resources (to handle larger designs using existing equipment and servers without adding memory or purchasing new equipment), and meeting increasingly tight development schedules. To address these issues, this article will also describe how TalusVortex significantly improves its capacity and performance through TalusVortexFX's innovative DistributedSmartSync™ technology. TalusVortexFX provides the first and only distributed placement and routing solution.

Introduction to the physical implementation process of TalusVortex1.2

Figure 1 shows a high-level view of the standard Talus Vortex 1.2 physical flow. From the figure, you can easily observe that it assumes the existence of a chip-level netlist, which may have been generated by Magma or a third-party design entry and synthesis tool.

Figure 1. High-level view of the standard Talus Vortex 1.2 process

The first step is to prepare the netlist; this includes various tasks, such as determining the location of input/output pads (I/Opad) and all macro cells. The second step is to perform standard cell placement (this is done at the same time as global routing, because routing may affect cell placement, and cell placement will also affect layout).

After the initial cell placement is complete, the third step is to synthesize the clock tree and add it to the design. Most clock tree synthesis tools do not perform a true multi-mode multi-corner (MMMC) clock tree implementation, but instead divide the timing environment into best-case and worst-case corners. However, this approach is too pessimistic and will cause performance to remain in a "no improvement" state. At the 32/28nm node, it is imperative to implement a true MMMC clock tree (see also the "MMMC Issues" section in the 32/28nm topic later). Therefore, the clock tree synthesis of Talus1.2 deploys a full MMMC analysis, achieving a more advanced robust clock system with an average 10% latency improvement and 10% area reduction, as shown in Figure 2

Figure 2. Full MMMC clock tree synthesis enables a more advanced robust clock system

Once the clock tree is added, the fourth step is to perform complex optimizations. The fifth step is then detailed routing. The convergence of the Talus1.2 flow ensures that the timing at the end of detailed routing closely matches the timing seen earlier in the flow, even when crosstalk is taken into account (see also the "Crosstalk Issues" section in the 32/28nm topic later in this article).

32/28nm low power consumption issues

Figure 3. Power consumption is the most important issue in chip design.

Engineers can deploy a variety of techniques to control the dynamic (switching) and leakage power of a device. These techniques include (but are not limited to) the use of multi-switching threshold (multi-Vt) transistors, multi-supply multi-voltage (MSMV), dynamic voltage and frequency scaling (DVFS), and power supply shutdown (PSO).

In the case of multiple switching threshold transistors, the cells on the non-critical timing path can be composed of high switching threshold (high-Vt) transistors with lower leakage, less power consumption and slower switching speed; while the cells on the critical timing path can be composed of low switching threshold (low-Vt) transistors with higher leakage, more power consumption and significantly faster switching speed.

MSMV involves dividing the chip into different regions (sometimes called "voltage islands" or "voltage domains"), with different supply voltages. Functional blocks assigned to higher voltage islands will have higher performance and higher power consumption; while functional blocks assigned to lower voltage islands will have lower performance and lower power consumption.

Dynamic voltage and frequency scaling (DVFS) is used to optimize the trade-off between performance and power consumption by changing the relative voltage or frequency of one or more functional blocks. For example, a nominal voltage of 1.0V can be reduced to 0.8V to reduce power consumption when the functional block activity rate is low, or it can be increased to 1.2V to improve performance when needed. Similarly, the nominal clock frequency can be reduced to half when the functional block activity rate is relatively low, or it can be increased by double to meet the high performance requirements for a short burst.

As the name implies, power shutdown (PSO) refers to cutting off the power supply to selected functional blocks that are not currently in use. Although this technology is very effective in saving power, it really needs to consider a lot of issues, such as: to avoid current surges, the power supply and power off of related functional blocks must be in a special order.

Talus Vortex 1.2 provides a complete integrated low-power solution, including an automated low-power synthesis method that can be used in conjunction with parallel analysis and optimization across multiple voltage and frequency regions. Talus 1.2 not only does not limit the number of different transistor switching thresholds used, but also supports unlimited voltage, frequency, and power cut-off regions. In addition, Talus 1.2 fully supports the Common Power Format (CPF) and the Unified Power Format (UPF). These two formats allow design teams to grasp the design intent from a power consumption perspective first, and then drive downstream planning, implementation, and verification strategies (see sidebar).

32/28nm crosstalk issues

The continued increase in clock frequencies and decreasing supply voltages means increasing sensitivity to signal integrity (SI) effects such as crosstalk-type delay variations, functional failures, etc. At the 32/28nm node, these effects are further enhanced due to closer adjacent tracks, cross-sections (tracks at the 32/28nm node can be taller than they are wide, as shown in Figure 4, which increases adjacent track coupling capacitance), and increased resistance (relatively speaking) of metallized tracks and vias.

Figure 4. The height of a 32/28 nanonode track can exceed its width.

Talus1.2 is known for its sophisticated track-based optimization algorithms, which allow users to solve crosstalk problems during global routing earlier in the process . Talus1.2 solves crosstalk-related problems in many ways, the most basic of which is to use optimal layer allocation and diffuse routing through available resources; it will effectively manage this diffusion to avoid significant negative impact on line length or number of vias. In addition, the global router comes with multi-threading capabilities to achieve ultra-high performance levels.

To achieve high performance, all global routers make assumptions. For example, they place wires in "buckets" that are set on top of each other so that it is intuitive to see at the outset. In most environments, the actual sorting and placement of tracks downstream in the flow is left to the detailed router. Solving crosstalk problems downstream in the flow takes an order of magnitude more effort, and fixing them as needed (e.g., upsizing cells with a corresponding increase in area and leakage power) may not be the best or even achievable approach.

[page]

In fact, it is only possible to accurately evaluate potential crosstalk effects when the track ordering and their spatial relationships are known. Therefore, Talus1.2 converts global track segments into spatially routable segments, which are then used to evaluate potential crosstalk issues earlier in the process; in this way, by reordering and setting the lines in the global routing stage, all crosstalk issues can be solved earlier in the process. These modifications made in the global routing stage can then be used to provide guidance to the detailed router downstream in the process, so that a better solution can be obtained with much less calculation work.

32/28nm process variation issues

For silicon chips manufactured at 180nm and higher technology nodes, all that is needed is to account for a small amount of wafer-to-wafer variation, which is the difference in characteristics such as timing (performance), power consumption, etc., between dies from different wafers. This difference can be caused by process variations from one foundry to another and by small differences in instrumentation and operating environment, such as furnace temperature, doping level, etch concentration, and the photolithography masks used to form the wafers.

At higher technology nodes, all die-to-die process variations (differences between dies on the same wafer) and intra-die process variations (differences between regions on the same die) are relatively unimportant. (Die-to-die variations are also called "global," "chip-to-chip," or "die-to-die" variations.) For example, if a chip's core voltage is 2.5V, then in most cases it is assumed that the entire die has a consistent and stable 2.5V voltage; similarly, it is assumed that the entire die has a uniform chip temperature.

As new technology nodes with increasingly smaller geometries emerge, die-to-die and intra-die process variations become increasingly important. Some of these variations are systematic, meaning they change with unit-level circuit functionality. For example, some parameters associated with a chip manufactured near the center of a wafer may be different than those manufactured toward the edge of the wafer; in this case, all parameters can be predicted to be similarly affected; while some parameters can fluctuate independently with random variation, which is said to be area-based variation (as opposed to distance-based variation).

Figure 5. At the 32/28nm node, inter-die and intra-die variations are extremely important.

Die-to-die and intra-die process variation, collectively referred to as on-chip variation (OCV), becomes extremely important at the 32/28nm node. This is because with each new technology node, it becomes more difficult to control key dimensions such as the width and thickness of transistor structures, tracks and oxide layers, resulting in the relative variation percentage (compared to some median value) becoming larger with each new technology node.

The traditional way to solve OCV is to use a first-order approach, which involves applying blanket tolerances across the entire chip. However, at the 32/28nm node, this approach is too pessimistic, leading to over-design, reduced design performance, and longer timing closure cycles. Therefore, Talus1.2 deploys a sophisticated advanced OCV (AOCV) algorithm that applies context-specific derating values based on the proximity of cells and tracks (e.g., two adjacent cells have less potential variation associated with each other than two cells at opposite ends of a die). This more realistic model reduces excess tolerances, thereby reducing pessimistic timing violations and improving device performance.

32/28nm Multimode Multicorner (MMMC) Issues

In addition to the variations in the manufacturing process mentioned in the previous topic, we must also address the potential variations in the environmental conditions (such as voltage and temperature) in which the chip is used. All of these variations can be classified as PVT (process, voltage and temperature) items.

For devices built at earlier technology nodes, die-to-die and intra-die PVT differences are negligible. It is possible to make assumptions and simplify the work based on the fact that there is consistent process variation across the entire die surface and stable environmental conditions such as core voltage and temperature across the die. Based on these assumptions, it is relatively easy to determine the bese-case (minimum) delay for each path using a range of bese-case conditions (highest allowed voltage, lowest allowed temperature, etc.); similarly, it is relatively easy to determine the worst-case (maximum) delay for each path using a range of worst-case conditions (lowest allowed voltage, highest allowed temperature, etc.).

Figure 6. The large number of modes and corners that need to be resolved at the 32/28nm node.

Specific series of conditions such as worst-case and best-case PVT are what we commonly call "corners". At the 32/28nm technology node, the differences in PVT between die and within die are very obvious, and it is essential to solve a large number of modes and corners. In addition, the low-power design technology mentioned above will further complicate this problem. For example: in the case of multi-supply multi-voltage (MSMV) technology, the voltage value of one voltage island may be the lowest voltage in its allowable voltage range, the voltage value of another voltage island may be the highest voltage in its allowable voltage range, and the voltage values of the remaining voltage islands may be in between the two. For example: some chips have different operating modes and one or more circuit modules are located in the center of the die where the power is cut off, which will lead to a significant increase in the corner situations that need to be analyzed.

The problem with current tools is that during implementation, the chip must be optimized under the MMMC outlook. Many existing systems approach the optimization problem by first considering the assumed worst-case scenario and then optimizing for the rest. Unfortunately, this can lead to excessive pessimism, resulting in suboptimal performance. Even worse, if these assumptions about what are the worst-case scenarios are wrong, the result can be a completely useless chip. Talus1.2 has built-in MMMC handling capabilities, which means that no scenario will be missed during the optimization process. In addition, the high speed and large capacity of Talus1.2 also means that it can consider not just a small subset of implementation scenarios, but the entire series of sign-off scenarios that the tool needs to handle. As a result, Talus1.2 provides better performance and shorter implementation cycles.

Enhance TalusVortex performance with DistributedSmartSync technology

Each step of the physical implementation process mentioned above is a computationally intensive problem. In order to solve the increasing complexity with the technology node, the amount of calculations that must be performed at each node is also increasing. In addition, as more and more functions are integrated into the device, the scale and complexity of the design will increase with each node, and the computational requirements related to the physical implementation will also increase accordingly.

Another factor is that the size of functional modules (the number of units required to implement the module's functionality) will continue to increase as more and more features are packed into each function. Some physical implementation teams prefer a hierarchical approach, while other teams prefer a "flat" approach because they feel that they give up too much when using a hierarchical approach.

If the tools had the ability to handle larger circuit blocks, productivity would be immediately improved. For example, defining and fine-tuning hierarchical inter-module constraints is an extremely time-consuming and resource-intensive task. If the tools had the ability to handle larger circuit blocks, there would be no need to define inter-sub-module constraints because there would not be any sub-modules. This would greatly improve productivity.

The problem is that most layout and routing solutions are limited to processing only a few million cells. This often forces physical implementation engineers to manually partition circuit blocks due to tool limitations, which in turn affects engineer productivity.

Unless enhanced in some way, even the current state-of-the-art Talus 1.2 place-and-route solution has a practical capacity of only 2-5 million cells, providing a productivity of 1-1.5 million cells per day. The result is a productivity gap driven by capacity. To address 32/28nm node designs, it is essential to implement flat circuit blocks containing more than 10 million cells, as shown in Figure 7 (see also sidebar).

Figure 7. Physical implementation tools are insatiable for flat capacity.

In the past, the capacity and performance of physical implementation tools have been enhanced by providing multi-threading capabilities. In some cases, these capabilities have been "bricked" onto traditional tools with limited effectiveness. In contrast, all tools in Talus 1.2 have fully built-in native multi-threading capabilities.

[page]

As mentioned above, multithreading has a limited effect on tools; based on Amdahl's law and other computer science laws, the effect of increasing the number of threads (with each thread running on its core) is decreasing. In simple terms, it tells us that the speedup of any program is limited by the amount of parallelism (that is, the longest sequence of fragments of the program related to other parts of the program), as shown in Figure 8.

Figure 8. Amdahl's Law reflects the limitations of multithreading.

For physical implementation tools used to create ASIC /ASSP/SoC devices, the parallelism of these tools is about 50% to 75%. As we can see from Figure 8, the "sweet spot" and the best-case scenario, using 8-10 processing cores, can only achieve about 3 times speedup.

Fortunately, the limitations defined by Amdahl's law can be overcome by distributing physical implementation tasks across multiple machines. As shown in Figure 9, Talus Vortex FX with the new Distributed Smart Sync technology provides unique distributed management combined with smart synchronization technology throughout all steps of the physical implementation flow (except clock tree synthesis, which this approach does not help much). Magma calls this latest solution Talus Vortex FX, which enhances Talus 1.2 with Distributed Smart Sync technology.

Figure 9. High-level view of the technology-enhanced Talus Vortex FX pipeline

The concept behind this technology is to intelligently partition a larger design or block, distribute the design partitions across the network of servers for design implementation, and then automatically resynchronize these designs during the main flow stage. Essentially, this allows designers to work on larger designs while still achieving the same throughput (i.e., cells per day) as they previously achieved on much smaller circuit blocks. Even when using the same number of cores/threads, this distributed approach is 2-3 times faster than the best multi-threaded flat flow.

Figure 10. Multithreading only vs. multithreading + distributed processing

The productivity of physical implementation engineers is generally measured in units per day. Using the best conventional flows, the maximum productivity that can be achieved is generally around 1 million units per day. In contrast, Talus Vortex FX's distributed processing technology can increase this number to 2-5 million units per day, which is used throughout the entire flow (for layout-only gate circuits, productivity can be improved even more, which is another metric that some users will pay attention to).

It is also worth noting that Talus Vortex FX provides physical implementation teams with the ability to perform rapid hypothetical analysis early in the design cycle to achieve the best trade-offs between area, speed and power consumption. But there is one point that cannot be ignored: Distributed Smart Sync technology fully enhances the existing Talus 1.2 technology, thereby promoting the rapid adoption of this product.

As for preserving investment in existing hardware resources, DistributedSmartSync technology allows users to fully utilize existing devices with 32GB and 64GB of memory. If this technology is not adopted and the 32/28nm node design is switched, users will be required to upgrade their devices to 128GB or 256GB of memory, which may cost millions of dollars for large server farms.

In addition to improving engineering team productivity by shortening design cycles and increasing the ability of engineering teams to use a flat approach (without having to add additional resources), the use of Talus Vortex FX also addresses the problem of meeting increasingly tight development schedules by reducing time to market (time to profit).

Summarize

There are many issues that arise when designing for 32/28nm and smaller technology nodes, including low-power design, crosstalk effects, process variation, and a significant increase in the number of operating modes and corners. Magma's TalusVortex1.2 physical implementation environment fully addresses all of these issues.

In addition, the increasing size and complexity of 32/28nm node designs also lead to an increase in engineering resources (greater results without expanding the team size), hardware resources (using existing equipment and server farms to handle larger designs without upgrading motherboards, adding memory or purchasing new equipment) and how to meet increasingly tight development schedules. To address these issues, Talus Vortex has significantly improved its capacity and performance through Talus Vortex FX's innovative Distributed Smart Sync™ technology.

Reference address：32/28nm node design solution based on Talus Vortex FX

Previous article：The explosion of passwords: QR code sites seek new models with N possibilities
Next article：Design of digitally controlled DC current source system based on microcontroller

Popular Resources
Popular amplifiers