Design of a new FPGA platform with scalable dynamic reconfiguration-EEWORLD

Collect

The new FPGA platform is highly flexible, scalable, and highly integrated, capable of integrating a complete heterogeneous dynamic computing system on a single or two chips.

Adaptive hardware is very useful in applications such as missile electronics and software radio, where power consumption and system size are limited and the environment is highly sensitive. Dynamic reconfiguration technology can be used to implement a dedicated architecture that supports different application modes without increasing system power consumption or board size. Traditional solutions focus on the control part, which now seems to be unable to effectively meet the number of execution units and their heterogeneity requirements. Only a distributed solution that is both flexible and scalable can create a future-oriented architecture.

Despite the potential of this technology, dynamic reconfiguration is still a challenge for the industry. Engineers need a clear design approach that can fully exploit the advantages of dynamic reconfiguration without affecting the application description and, most importantly, without increasing the development cost. To combine dynamicity with high performance, we propose to abstract heterogeneity by adopting a multithreaded execution model. Developers can program applications as a collection of threads, regardless of whether the threads are executed on standard processors or dedicated hardware. In this case, dynamic reconfiguration plays a role in thread preemption and context switching. The FOSFOR (Flexible Reconfigurable Platform Operating System) project, sponsored by the French National Research Agency (ANR), is dedicated to developing this new generation of embedded, distributed real-time operating systems.

1 FOSFOR Architecture Basics

Our goal is to design an architecture that supports new types of system partitioning, where software/hardware components follow the same execution model. This requires a highly flexible and extensible operating system that provides similar interfaces to both the software and hardware domains. Unlike traditional approaches, this operating system is fully distributed and the entire platform is homogeneous from the application perspective. This means that application threads can be deployed either statically or dynamically in software (processors) or hardware (reconfigurable units), with indiscriminate access to distributed services.

To achieve high efficiency, we implement OS services in hardware right next to the reconfigurable region. We implement a communication layer between heterogeneous OS kernels to ensure that services are homogeneous from the application perspective. Therefore, deploying the OS as a large number of modules and execution units on the architecture can take full advantage of virtualization mechanisms, allowing application threads to run and communicate without foreseeing tasks.

From a programmer's perspective, the application is just a collection of threads. We can leverage the dynamic reconfiguration capabilities of Xilinx FPGAs to propose this new concept of hardware threads, and we can implement it in the same way as software threads. Our implementation leverages the performance benefits of dedicated compute IP blocks.

Besides considering the execution units in a multiprocessor SoC, the memory structure must also meet several requirements: data storage required by application threads, storage of the execution context of each thread, and data exchange between threads. For the storage of the execution context, we consider several possibilities. One way is to store the execution context centrally, thus providing a medium for distributing it to different execution units. We can identify three communication flows within the platform: application data, control signals, and reconfiguration/execution context. For high-bandwidth data paths between hardware threads, we use a dedicated network-on-chip (NoC).

Figure 1 General FOSFOR architecture

Text in the picture:

Flexible OS Software Threads Application Middleware (Virtualization, Distribution, Flexibility) OS 1 (X Services) OS n (Y Services) Hardware Abstraction Layer (HAL) Software Communication Unit Hardware Communication Unit Hardware Software Node (GPP) Hardware Node (Reconfigurable Region) Network on Chip Shared Memory

2 Global Architecture

The global architecture is shown in Figure 1, which consists of:

A family of non-specialized (general purpose) processors (GPPs). The GPPs are responsible for supporting the execution of software threads, as well as a range of operating system services including thread scheduling. GPPs do not have to be homogeneous in terms of instruction set architecture and the number of services they provide.

A set of dynamically reconfigurable partitions, also called reconfigurable regions (RRs). Dynamically reconfigurable partitions are responsible for executing a set of hardware threads in parallel or serially. Similar to GPP, RRs also support the execution of operating system services due to the use of a hardware operating system (HwOS). These regions correspond to fine-grained (FPGA) or coarse-grained (reconfigurable processor) architectures.

Virtual communication channels that share one or more physical communication channels for control, data, and configuration. The control channel is responsible for distributing communication between operating system services to execution units (GPP and RR). The data channel is responsible for transmitting information related to the environment (devices, sensors) and information exchange between threads. The configuration channel is responsible for transmitting the configuration of software threads (binary code) and hardware threads (partial bitstreams) between the configuration memory and the execution units.

Each processor has its own local memory. This memory is responsible for storing local data and, where applicable, software code. Shared memory connected to the data channel allows data sharing between threads on different processors. Each execution unit can access data and software execution resource programs stored on the shared memory. Each resource can also access the configuration memory to save and restore its execution context. With this structure, any thread or service can be implemented on any execution resource.

Inside the RR, only hardware tasks need to be dynamically reconfigured. The dynamic region (DR) responsible for hosting tasks is surrounded by the static region (SR) containing the hardware implementation of the operating system services and providing the communication medium inside and outside the RR. Internal data flow communication relies on a dedicated on-chip network. The interface between the DR and the SR uses bus macros and has a fixed location. To achieve this constraint and abstract the heterogeneity of the communication medium, we use a middleware solution to provide virtual access to the reconfigurable partitions. The RR is built according to the model defined in Figure 2. The FOSFOR prototype platform consists of dynamically reconfigurable FPGA devices that can directly support this architectural model. We chose the Virtex-5? device because it can reconfigure rectangular regions.

We define a scheduling/placement algorithm based on the pre-calculated resource requirements of application threads to ensure efficient utilization of FPGA elements (LUTs, registers, distributed memory, I/O) in each RR.

Figure 2 Reconfigurable region structure

Text in the picture:

Control context (bitstream) Static region Reconfigurable region Static region Data Hardware Operating system control Dynamic region Thread data Network on chip Hardware partitioning

3 Operating Systems, Networks on Chip, and Middleware

To provide flexibility, the FOSFOR architecture uses at least two operating system instances: a software operating system that runs on each processor and is responsible for handling software threads; and a hardware operating system that can manage hardware threads. To achieve the best balance between performance, development time, and standardization, we use existing software operating systems and new hardware operating systems.

The hardware operating system takes advantage of the dynamic partial reconfiguration capability of Xilinx FPGAs to schedule hardware threads as flexibly as traditional operating systems schedule software threads.

The requirements for the software operating system are real-time behavior, the ability to handle multiple processors and provide basic inter-process communication services. We chose a free and open source operating system, RTEMS. For compatibility reasons, we chose the LEON Sparc soft-core processor, which is also free and open source, like the software node.

The hardware operating system (HwOS) uses the dynamic partial reconfiguration function of Xilinx FPGA to schedule hardware threads as flexibly as traditional operating systems schedule software threads. The hardware thread consists of two parts: dynamic and static. The dynamic part contains an IP module for executing thread functions and a finite state machine for synchronizing the service call order with the hardware operating system. The static part contains a control interface connected to the hardware operating system and a network interface for exchanging data with other hardware and software tasks.

To support multiple inter-thread data transfer needs, we developed a flexible on-chip network DRAFT. The communication services of traditional operating systems are sufficient to support communication between software threads. However, in our design, the operating system also needs to support communication between hardware threads. For this purpose, we designed a special DRAFT network. We synthesize hardware threads one by one for one or more DRs, and statically define each DR interface.

The static definition of the communication interface allows us to define a static on-chip network. Generally, hardware threads require high bandwidth and low latency, so the on-chip network must provide high performance. The topology we chose for DRAFT is an extension of the fat tree topology. The main goal of our design is to limit resource overhead while achieving high performance inter-thread communication.

The heterogeneity of hardware platforms is a major complexity barrier that designers face when deploying applications. In the FOSFOR project, this heterogeneity comes not only from different embedded processors in the software domain, but also from the integration of software and hardware computing models on a single platform.

This problem can be solved by using middleware to establish an abstraction layer between hardware and software and provide a homogeneous programming model. The middleware implements a set of virtual channels that allow threads to communicate without having to worry about the implementation area of the threads. These services are distributed across platforms and provide a flexible and extensible abstraction layer that completes the FOSFOR concept.

4 Performance Acceleration

The main reasons for building a hardware OS are performance and flexibility. The OS could have been pure software or pure hardware. Since every call to an OS primitive involves overhead, i.e. thread wait time, the faster the OS, the less time is wasted. To evaluate the overhead, we have to compare the timings of the hardware OS with the original software OS RTEMS.

The hardware local operation only takes tens of cycles, while the hardware global operation takes hundreds of cycles to access shared memory. According to our evaluation, compared with the results of the software operating system, the local create-delete operation speed is increased by 60 times, and the speed of other operations is also increased by about 50 times.

The resource usage of the hardware operating system (Table 1) varies greatly, depending on the number of services activated and their functionality, such as the number of objects (semaphores, threads, etc.) we choose for each service. We use a Xilinx Virtex-5 FX100T to implement the system. The table lists the resources used by the hardware operating system. The remaining resources can be used to implement other system components and the hardware threads themselves.

Table 1 Resource usage of the hardware operating system (Virtex-5 FX100)

Regarding network performance, in a configuration where DRAFT connects eight 32-bit word-width components with a buffer depth of four words and a frequency of 100MHz, the on-chip network can achieve a maximum data rate of up to 1,040Mbps for each connected component. The network topology and routing protocol ensure that there will be no contention and congestion. At least one communication path is always maintained between two interconnected components. The average latency of data passing through DRAFT is close to 45 clock cycles (450 nanoseconds), which meets the requirements of many applications.

5 Conclusion

We propose an innovative operating system that can provide a homogeneous execution model based on multithreading on a heterogeneous multicore architecture consisting of multiple processors and dynamically reconfigurable hardware IP blocks. The hardware operating system is responsible for managing hardware threads, generally for thread creation and inhibition, as well as information flow and message queue services. In terms of communication, we propose to improve the fat tree topology on-chip network for data exchange, a dedicated bus for hardware thread management, and a communication layer for synchronization between operating systems.

Reference address：Design of a new FPGA platform with scalable dynamic reconfiguration

Previous article：Design of TV aiming and viewing system based on DSP and FPGA
Next article：FPGA Configuration Mode

Popular Resources
Popular amplifiers