Fast Startup for Xilinx FPGA-EEWORLD

Collect

In many modern applications, embedded systems must meet extremely demanding timing requirements. One of these requirements is boot time, which is the time it takes for an electronic system to enter an operational state after power is applied. Examples of electronic systems with stringent timing requirements are PCI Express® products or CAN-based electronic control units (ECUs) in automotive applications.

Just 100 milliseconds after power is applied to a standard PCI Express® (PCIe) system, the system's root component begins scanning the bus to understand the topology and, in the process, initiates configuration. If a PCIe device is not ready to respond to configuration requests, the root component cannot find the PCIe device and assumes it does not exist. The device cannot join the PCIe bus system. [1]

The situation in automotive applications is similar. In a CAN-based network, ECUs enter a sleep mode, where they stop running and are disconnected from power. Only a small portion of the circuitry remains alert to detect a wake-up signal. Once a wake-up event occurs, the ECU reconnects power and begins booting. Although some messages can be missed in the first 100 milliseconds after the wake-up event, all ECUs must be fully operational on a network such as a CAN network after this time.

Intensive R&D work between Xilinx Automotive, Xilinx Research Labs, and the Karlsruhe Institute of Technology in Germany is addressing this issue with a two-step configuration method for FPGAs.

Technology trends in the semiconductor industry have enabled FPGA vendors to significantly increase the resources in their devices. However, bitstream sizes have also increased proportionally, as has the time required to configure the device. Therefore, even for medium-sized FPGAs, it is not possible to meet stringent startup timing requirements using low-cost configuration schemes. Figure 1 shows the configuration times for different Xilinx® Spartan®-6 FPGA devices using the low-cost SPI/Quad-SPI configuration interface. Even using the fast configuration scheme (i.e., Quad-SPI running at a 40 MHz configuration clock), only small FPGA devices can achieve the 100 ms startup timing requirement. This result appears to be more challenging for Xilinx Virtex®-6 devices, as these devices offer more abundant FPGA resources.

To overcome this challenge, Fast Startup configures the FPGA device in two steps instead of a single step (whole chip) full device configuration. Following this novel approach, our strategy is to load only timing-critical modules at power-up using the highest priority bitstream, followed by non-timing-critical modules. This approach minimizes the initial configuration data, thereby minimizing the boot time of the FPGA device for timing-critical designs.

FAST STARTUP vs. Partial Reconfiguration
Fast Startup allows the FPGA design to start up the critical modules of the design as quickly as possible, much faster than the standard full configuration method [2]. Although, in essence, Fast Startup utilizes partial reconfiguration, it is different from the traditional concept of this method. The original intention of partial reconfiguration is to use the complete design as an initial configuration that can be modified at run time. In contrast, Fast Startup already uses an initial partial bitstream to configure only a specific (small) area of the FPGA device at power-up. The first configuration contains only those parts of the complete FPGA design that must be configured and run quickly. The rest is configured later, at run time, using partial reconfiguration. Figure 2 illustrates this sequential concept.

Tool Flow Overview
The Fast Startup tool flow relies on a design preservation flow to create partial bitstreams for both timing-critical and non-timing-critical subsystems.

The design preservation flow partitions the FPGA design into logical modules (called “partitions”). Partitions form hierarchical boundaries that isolate the internal modules from other components in the design. Once a partition is implemented (i.e., placement and routing is complete), it can be imported by other implementation runs to implement the partitioned modules in exactly the same way in each instance [3].

Therefore, the first step using the Fast Startup methodology is to partition the complete FPGA design into two parts: a high-priority partition containing the timing-critical subsystems and a low-priority partition for the remaining components.

Figure 1 - Logarithmic representation of calculated Spartan-6 configuration time (worst case calculation)

Figure 2 – Fast Startup concept: sequential configuration

Implementation of High Priority Partitions
There are some general design considerations to get the smallest possible partial bitstream for a high priority partition. First, the partition must contain only components that are either timing critical or that are needed by the system to perform partial reconfiguration of a low priority portion (such as an ICAP). The key to getting a small initial partial bitstream is to implement the high priority partition using the smallest possible area. That is, you must confine the partition to an appropriate region in the FPGA.

This region should provide the appropriate number of resources required by the design in order to find an ideal physical location in the FPGA. Accessing resources outside of this region is possible, but not encouraged—although

it is generally unavoidable for I/O pins. When finding an appropriate region, also keep in mind that this region of the FPGA may obstruct resources in non-timing critical portions of the FPGA design.

Once you have partitioned the FPGA and have found appropriate regions for these partitions, the next step is to implement the high priority partition using an empty (black box) low priority partition. The resulting bitstream contains many configuration frames for unused resources. You can remove these frames to get a valid partial bitstream for the initial configuration of the high priority partition. [4]

Implementation of Low Priority Partition
To create the partial bitstream for the low priority partition, first, you create an implementation of the complete FPGA design with both partitions, the high priority partition and the low priority partition. Import the high priority partition from the previous implementation so that its implementation is the same as the original one.

For Virtex-6 devices, the partial reconfiguration (PR) flow can be used for all the above implementations. This automatically generates the partial bitstream for the low priority partition. Since the Spartan-6 device family does not support the PR flow, we used the BitGen option for differentiated partial reconfiguration to obtain the partial bitstream for the low priority partition when implementing Fast Startup for Spartan-6 designs. [5] Figure 3 gives a high-level overview of the tool flow.

Figure 3 – Fast Startup tool flow

Experiments and Results
To verify the Fast Startup configuration method in hardware, our research group implemented this method on a Virtex-6 ML605 board and a Spartan-6 SP605 board.

The application background of the Virtex-6 implementation comes from the video field. When users turn on the power of the video system, they always want to see the system respond immediately without waiting for several seconds. Therefore, in the system shown in Figure 4, a high-priority subsystem equipped with a TFT controller can quickly light up the TFT screen. For other low-priority applications, the second design provides control and access to the Ethernet core, UART, and hardware timers.

Figure 4 – Basic block diagram of Virtex-6 and Spartan-6 demonstration (Virtex-6 includes TFT module, Spartan-6 only includes CAN module)

For this demonstration, we used an external flash memory with BPI as the configuration interface. Once the initial high-priority bitstream configures the processor subsystem, software running outside the BRAM initializes the TFT controller and writes data to the frame buffer in DDR memory. This ensures that the screen appears quickly on the TFT during boot. Afterwards, a second bitstream is read from the BPI flash memory and configures the low-priority partition, allowing the processor subsystem to run other applications such as a web server. To

facilitate expansion and cleanly isolate the two partitions, an AXI-to-AXI bridge was used. This also minimized the number of nets that cross the boundary between the two design partitions. The low-priority partition shares the system clock with the high-priority partition.

Table 1 shows the FPGA resource utilization, and Table 2 shows the configuration time for the traditional boot method, the boot method with only the compressed bitstream of the high-priority partition [6], and the Fast Startup configuration method. Each method uses the BPIx16 configuration interface, and the configuration rate (this option determines the target configuration clock frequency) is 2 MHz and 10 MHz. We measured this data using an oscilloscope to capture the FPGA’s “init” and “done” signals. The “Compressed” column in Table 2 shows the compressed bitstream of only the high priority partition. The compressed bitstream for the complete FPGA design with two partitions would be 3.1 Mbytes.

Resource Type	Partition
Resource Type	High priority	%	Low Priority	%
trigger	8,849	2.9	1,968	0.7
Lookup Table	7,039	4.7	2,197	1.5
I/O	135	22.5	20	3.3
RAMB36s	34	8.2	2	0.5

Table 1 - Occupied FPGA resources (for XC6VLX240T)

XC6VLX240T	Configuration Method
Configuring the Interface	Traditional 8.9 MB	Compressed 2.0 MB	Fast Startup 1.4 MB
BPIx16 CR2	1,740 ms	389 ms	278 ms
BPIx16 CR10	450 ms	112 ms	84.4 ms

Table 2 - Measured Configuration Times (Virtex-6 Video Design)

SPARTAN-6 Automotive ECU Design
To validate the Fast Startup approach for Spartan-6, we chose an ECU application scenario in the automotive field. Whenever you see an FPGA device in an automotive electronic control unit, it is generally used only by the main application processing unit of the ECU (see Figure 5). Our goal was to implement a design that puts the system processor into the FPGA. This way we can avoid the need for an external processor, thereby reducing the cost, complexity, space and power consumption of the entire system.

Figure 5 – FPGA applications in modern automotive ECUs, with the processor integrated into the FPGA (dashed line)

System Partitioning
For this scenario, system partitioning is obvious. We split our ECU design into a system processor part as a high priority partition and an application processing part as a low priority partition.

This design has many similarities to the Virtex-6 design, but the difference is that we use SPI instead of BPI as the interface to the external flash memory, so the TFT controller must be replaced by a CAN controller. After power-up, the system controller has only a limited time to boot up and be ready to handle the first communication data. Since the ECU uses the CAN bus for communication, this boot time is typically limited to 100 milliseconds. With traditional configuration methods, it is difficult to achieve such tight timing requirements using a large Spartan-6 with a low-cost configuration interface such as SPI or Quad-SPI. Using a faster and more expensive configuration interface is unacceptable in the automotive field.

Measurement Setup
For the SP605 automotive ECU demonstration, we performed measurements in the lab, which are shown in Figure 6. On the left side of the figure is a Spartan-3-based X1500 automotive platform that implements a network packet generator for the CAN bus. The generator is able to send and receive CAN messages and uses a hardware timer to measure the time between CAN messages. On the right is the target platform, which is not directly connected to the CAN bus, but uses a CAN transceiver from an additional custom board. In addition to providing a CAN PHY, this custom board also controls the power supply of the target board.

Figure 6 – Measurement setup for automotive ECU

The procedure for measuring the configuration time starts with the network sender in idle (neutral) state, the CAN transceiver on the CAN PHY board is also in sleep mode, and the SP605 is disconnected from the power supply. Next, the network sender starts a hardware timer and sends a CAN message. After recognizing the event on the CAN bus, the CAN PHY wakes up and reconnects the power to the SP605. The FPGA then starts loading the initial bitstream from the SPI flash memory.

Since no receiver acknowledges the message sent by the network sender, this message is immediately repeated until the FPGA has completed its configuration and configured the CAN core with a valid baud rate. Once the CAN core of the Spartan-6 design acknowledges the message, the CAN core of the network sender triggers an interrupt, which stops the hardware timer. This timer now holds the boot time of the SP605 design. The measurement results include an additional hardware timer in the SP605 design, which shows that the software startup time is negligible when the software is executed to configure the CAN core with built-in BRAM memory.

Table 3 shows the FPGA resource consumption for each partition. The percentage information is used to indicate the total amount of available resources of the XC6S45LXT device used.

Resource Type	Partition
Resource Type	High priority	%	Low Priority	%
trigger	3,480	6%	1,941	4%
Lookup Table	3,507	13%	1,843	7%
I/O	58	20%	20	7%
RAMB	12	10%	2	2%

Table 3 – FPGA resources used in Spartan-6 designs

Configuring the Interface	Configuration Method
Configuring the Interface	Traditional 1,450 KB	Compressed 920 KB	Fast Startup 314 KB
SPIx1 CR2	5,297 ms	3,382 ms	1,157 ms
SPIx1 CR26	292 ms	196 ms	85 ms
SPIx2 CR2	2,671 ms	1,699 ms	596 ms
SPIx2 CR26	161 ms	113 ms	58 ms
SPIx4 CR2	1,348 ms	872 ms	311 ms
SPIx4 CR26	97 ms	73 ms	45 ms

Table 4 – Measured Spartan-6 configuration times

Table 4 shows the configuration time measurements. For these results, we implemented and compared a standard bitstream and a compressed bitstream for the complete design and the Fast Startup method using a partial initial bitstream. The table lists the configuration time for different SPI bus bandwidths and different configuration rate (CR) settings. As expected, the configuration time is proportional to the bitstream size. Since the clock does not affect the housecleaning process with fast configuration, the ratio (in percentage) changes for high CR settings. Validation

in Hardware
The advanced configuration method we developed can be called prioritized FPGA startup because it configures the device in two steps. This method is not only essential to address the challenge of increasing configuration time in modern FPGAs, but it can also be used in many modern applications such as PCI Express or CAN-based automotive systems.

In addition to proposing a high-priority initial configuration method, we also validated this method in hardware. We used and tested the tool flow and methodology for Fast Startup to implement a CAN-based automotive ECU on a Spartan-6 evaluation board (SP605) and a video design on a Virtex-6 prototype board. By using this novel approach, we have reduced the initial bitstream size, resulting in an 84% improvement in configuration time (compared to the standard full configuration scheme).

Xilinx will support the Fast Startup concept for PCI Express applications in software for 7 series FPGAs and simplify its use with an optimized implementation. In 7 series, the new two-step bitstream approach is the simplest and lowest cost approach to implement. When designing an FPGA, the user can implement a two-stage bitstream with a simple software switch. The first stage of this bitstream contains only the configuration frames required to configure the timing critical blocks. During configuration, an FPGA STARTUP sequence is generated and the critical blocks become active, making the 100 ms timing requirement easily met. While the timing critical blocks are running (e.g., the PCI Express enumeration/configuration system process is in progress), the rest of the FPGA configuration is loaded. The two-stage bitstream approach enables the use of inexpensive Flash devices to store the bitstream.

Keywords：FPGA Automotive Reference address：Fast Startup for Xilinx FPGA

Previous article：Design of Analog Signal Waveform Based on FPGA
Next article：Design of real-time audio processing system based on DSP

Recommended ReadingLatest update time:2024-11-16 21:50

Using FPGA to solve DSP design challenges

DSPs are important in electronic system design because they can quickly measure, filter, or compress real-time analog signals. In this way, DSPs help enable the communication between the digital world and the real (analog) world. But as electronic systems become more sophisticated and need to process multiple analog

[Embedded]

Using FPGA to solve DSP design challenges

A Low-Power-Aware FPGA Design Approach for Portable Products

ILGOO series low power FPGA products Actel's ILGOO series devices are low-power FPGA products and are the best solution to replace ASIC and CPLD in portable product design. Its static power consumption in Flash*Freeze mode can reach as low as 2µW, and the battery life is more than 5 times that of pr

[Embedded]

A Low-Power-Aware FPGA Design Approach for Portable Products

Lattice Semiconductor Expands ECP5™ FPGA Product Family

• Lattice expands low-power, small-footprint ECP5™ FPGA portfolio • First device to support 5G SERDES and 85K LUTs in 10x10 mm package • 12K LUT device with cost-optimized programmable IO bridging capabilities Recently, Lattice Semiconductor Corporation (NASDAQ: LSCC) announced a new member of the ECP5™ FPGA p

[Embedded]

How to Build a Doppler Measurement System Using DSP and FPGA

Doppler measurement system Doppler measurement systems use the Doppler effect to measure the velocity of a moving object (solid, liquid or gas). Probably the most famous application is the radar gun, which is used by traffic patrols to detect speeding cars. When measuring the motion of something other than the spe

[Embedded]

How to Build a Doppler Measurement System Using DSP and FPGA

Application of FPGA and USB technology in textile digital printing machine system

Abstract: This paper introduces the design overview of the textile digital printing machine and the characteristics of the USB controller CY7C68013A. It also describes the design of FPGA access control operations on the USB controller, USB controller firmware program design, USB driver design, and PC applicati

[Embedded]

Application of FPGA and USB technology in textile digital printing machine system

Using EEPROM to load data serially into large-capacity FPGA chips

Since the advent of large-scale field programmable logic devices, two types of devices have emerged. One is the FPGA series based on SRAM architecture, such as XILINX's 4000 series and the latest Virtex series; the other is the CPLD device based on faxFLASH technology, such as XILINX's 9500 series and Lattice's ispL

[Embedded]

Using EEPROM to load data serially into large-capacity FPGA chips

Design of Oscilloscope Graphic Display System Based on FPGA

0 Introduction FPGA (Field Programmable Gate Array), or field programmable gate array, is a large-scale programmable logic device that can replace all current microcomputer interface chips and realize multiple functions such as storage and address decoding in microcomputer systems. FPGA can be used to integ

[Embedded]

Design of Oscilloscope Graphic Display System Based on FPGA

ADS8344 and FPGA high-precision data acquisition front end

Data acquisition is a very important part in industrial test systems, and its accuracy and reliability are crucial. The data acquisition system described in this article has an accuracy of up to 16 bits, can perform A/D sampling on 8 external analog channels, and the maximum analog input signal range reaches -15~+15

[Embedded]

ADS8344 and FPGA high-precision data acquisition front end

Popular Resources
Popular amplifiers