How much do you know about SoC?
SoC is the abbreviation of System on Chip, which literally means "chip-level system", usually referred to as "system on chip". Because it involves "Chip", SoC also reflects the connection and difference between "integrated circuit" and "chip", and its related content includes integrated circuit design, system integration, chip design, production, packaging, testing, etc. Similar to the definition of "chip", embedded SoC emphasizes more on the whole. In the field of embedded applications, it is defined as: a system or product formed by combining multiple integrated circuits with specific functions on a module, which includes a complete hardware system and the embedded software it carries.
Memory is an important module in the design of integrated chips for resource-scarce systems SoC (System on Chip), and is the part with the largest cost ratio in SoC. The hardware and software design of memory management is an important part of the SoC software architecture design. Architects must strike a balance between cost and efficiency to save memory while ensuring the performance of the entire system. System memory demand assessment is the most basic requirement for embedded software architects, and it is also one of their most important skills. Generally, when a SoC project is established, the architect must complete the system memory demand assessment.
Embedded SoC has two significant characteristics: first, the core hardware design is difficult; second, the software accounts for a large proportion, which requires software and hardware co-design. For example, the advantages of cities over rural areas are obvious: complete supporting facilities, convenient transportation, and high efficiency.
Embedded SoCs also have similar characteristics: more supporting circuits are integrated on a single module, saving the area of integrated circuits, which means saving costs, which is equivalent to improved energy utilization in cities; on-chip interconnection is equivalent to the city's expressways, which are high-speed and low-power. The information transmission between various devices originally distributed on the circuit board is concentrated in the same module, which is equivalent to a place that originally required a long-distance bus to reach, but has now been moved to the city and can be reached by a subway or BRT, which is obviously much faster; the city's tertiary industry is developed and more competitive, and the software on the embedded SoC is equivalent to the city's service business, which requires not only good hardware but also good software; the same set of hardware can be used to do one thing today and another thing tomorrow, similar to the improvement of resource allocation, scheduling and utilization of the entire society in the city.
It can be seen that embedded SoC has obvious advantages in performance, cost, power consumption, reliability, life cycle and scope of application.
Embedded SoC Design Concept
In the integrated design technology of embedded systems, system design is often very complex and generally requires learning and understanding a lot of technology-related knowledge. Moreover, any changes in requirements, design, or modifications require the design to be started over, which affects the entire system.
Modular design is the basic design concept of embedded SoC. The basic design concept of embedded SoC technology design and application electronic system is to realize modular design of the whole system. Users only need to select and replace various modules and embedded structures according to their needs to achieve the required goals without having to spend time familiarizing themselves with specific circuit development technologies. The advantage of modular design is that the system can be closer to the ideal system and it is easier to achieve the design goals.
Embedded SoC Design Reuse Technology
Embedded system applications are becoming more and more complex, and the development of the underlying hardware driver is related to the stability of the entire system. If you start from the lowest register operation and build the entire development platform step by step, you must invest a lot of money, personnel and time to ensure the reliability of the system. Therefore, it is recommended that complex embedded products do not start from scratch, but build the design on a higher level. Reuse more module technologies. Only in this way can the design be completed faster, ensure the success of the design, and quickly meet market demand.
Embedded SoC design reuse is based on core modules (CORE), which is to reuse the verified complex embedded hardware and software systems for subsequent design. Embedded SoC usually consists of two parts, one part is called hardware modularization, which has a complex, high-performance embedded processing minimum system and specific functions, and has been verified for stability, and can be directly called by the new design as a specific functional module.
The other part is the firmware (firm core), which is developed on the basis of the firmware, as shown in Figure 5 1. Development engineers bid farewell to the development mode of operating registers and do not need to understand the ARM hardware functions. They only need to call the API functions of the firmware such as the underlying hardware driver, OS, GUI, FAT file management system, TCP/IP protocol stack, CAN-bus high-level protocol, etc. to quickly develop a stable and reliable product. This is the goal of embedded SoC design reuse technology.
Embedded SoC Process
The following uses the SoC design in low-end multimedia electronics (such as bombbox, point-to-point learning machine, voice recorder, etc.) solutions as a prototype to illustrate the general evaluation process.
1. Decompose the functions and performance of each application scenario according to product specifications
Product specifications generally describe application functional scenarios and performance. Architects need to decompose the functions and performance of each scenario and analyze the relationship between each scenario in terms of memory usage. This includes:
1) List all application scenarios and clarify the life cycle of each application, when it starts and when it ends.
2) Whether the system needs to support multiple applications (multiple processes) at the same time. For example, browsing pictures while listening to a song means that the two applications use the memory at the same time, and time-sharing multiplexing of application memory is not possible.
3) Does the system need to support multiple media at the same time? For example, accessing card devices and flash memory devices at the same time. Generally, in a single process, only a single storage device is accessed, unless data replication is implemented. However, in a multi-process environment, it is normal for different processes to access different storage devices. Accessing different storage devices at the same time means that the two drivers are using memory at the same time.
4) Whether the system needs to support multiple file systems at the same time. Different storage devices may deploy different file systems, which also have the problem in 2).
5) Clarify the codec formats supported by the system, which is reflected in the algorithm memory requirements. Different codec formats have different memory requirements. For the same algorithm, different rates also lead to different memory requirements.
6) System performance requirements, such as LCD screen refresh, a large framebuffer will naturally have better performance.
2. Layer the system software and clarify the composition of each layer module
1) The system is divided into startup, driver, operating system, file system, middleware (algorithm, UI), application framework, application and other levels. General consumer electronic products, such as multimedia devices, game consoles and other product systems are divided into multiple levels. Each level will be composed of multiple modules. For example, the driver is divided into character device driver and block device driver. Generally, buttons belong to character devices, and storage devices generally belong to block devices. Storage devices may support nand flash, SD-MMC card, Uhost, etc. The file system has FAT32, exfat, etc. The application layer will of course include many applications.
2) Clarify the software layers required for each application. Some applications may require many layers, such as music, from application to application framework (UI+buttons), API, middleware (decoding), operating system, driver and other layers, while the setting application does not require decoding middleware.
3. Identify the memory time-sharing modules in each software layer and find the modules with the largest memory requirements
As mentioned in the article "Software Design Techniques for Saving Memory in Resource-Constrained Embedded Systems", there is a need for reuse of applications, drivers, middleware, and data segments. It is necessary to distinguish the different component modules in each software layer in the previous section 2) and clarify whether each module can be reused in time sharing. In the case of reuse, find the module with the largest memory requirement. For example, the nand flash driver is more complex than the card driver, so the memory requirement of the nand flash driver is naturally higher; and the music application is naturally more complex than the settings or FM applications, so its memory requirement is naturally greater.
4. Analyze the code of the module with the largest memory demand, and roughly identify its resident memory code and bank management code
The resident code segment is generally a frequently called and high-performance code segment, such as interrupt management, message management, etc. In general applications, a large amount of code can be loaded and executed on demand, such as music sound effect management and volume setting. These functional codes do not require high execution performance and can be loaded into the memory for execution in a time-sharing manner, which can achieve the purpose of memory time-sharing multiplexing.
5. Determine the resident code space and time-sharing memory space for each software layer
To reduce the resident code space as much as possible under the cost requirement will lead to a decrease in code execution performance, because the bank code must be loaded before execution, usually read from the nand flash or card; to reduce the memory space reused by the bank code as much as possible under the cost requirement, it will also lead to frequent bank code switching and reduce performance. Therefore, we cannot simply reduce the memory, but carefully analyze the general requirements of the functions and performance of each sub-module for memory. For example, if the function implementation of two sub-modules is 8k and 4k, then we can consider 2K reuse space, that is, the former is divided into 4 banks and the latter is 2 banks, whether it can achieve the performance; if the reuse space is set to 4K, the efficiency will be higher, but the cost will increase; if it is set to 1k, then the former will have 8 banks, and the number of switching is too many.
6. Identify the code space that can be solidified
The resident code of an application cannot be solidified, because different applications require resident code, that is, it is changing, while codes such as interrupt management, time management, and task scheduling management of the operating system are generally unchanged and can be solidified into ROM, which can achieve the purpose of saving memory.
7. Consider other special needs
Through 6, we can roughly get the memory requirements of the entire system. At this time, we need to consider the memory requirements of some special scenarios to see whether the previously formulated memory can meet this scenario. For example, the memory demand distribution in the startup phase, the memory requirements when the OS boots and initializes, etc. These are not product specifications, but are also considered by the architect.
The memory obtained in step 6 is usually re-evaluated for minor adjustments.
Software and hardware integration has always been the core technology of SOC chip design. System software architects and chip system designers jointly evaluate and design each module of the SOC, taking performance, cost, software programming flexibility, software scalability, etc. as considerations to decide which processes of the module can be software-hardened and which links can be hardware-softened.
Most of the program code can be loaded into the memory for execution only when necessary, and can be directly discarded or overwritten by other code after running. There are many applications running on our PC at the same time, and the virtual address space used by each application can almost be the entire linear address space (except for a part left for the operating system or reserved for other uses). It can be considered that each application monopolizes the entire virtual address space (a CPU with a word length of 32 has a virtual address space of 4G), but our physical memory is only 1G or 2G. That is, multiple applications are competing for the use of this physical memory at the same time, which will inevitably lead to only a certain fragment of the program being executed at a certain time, that is, all program codes and data are time-sharing and multiplexing the physical memory space - this is the core role of the memory management unit (MMU).
Processor series chips (such as X86, ARM7 and above, MIPS) generally have MMU, which implements virtual memory management together with the operating system. MMU is also a hardware requirement for operating systems such as Linux and Wince. However, controller system chips (for low-end control fields, ARM1, 2, MIPS M series, 80251, etc.) generally do not have MMU, or they only have a single linear mapping mechanism.
1. Working mechanism of memory management unit (MMU)
Before explaining the memory management in the controller field, we should first introduce the virtual memory management mechanism in the processor field. The former is largely a reference to the core mechanism of the latter. There are several modules that work in coordination to implement virtual memory management: CPU, MMU, operating system, and physical memory, as shown in the figure (assuming that the chip series has no cache):
Let's analyze the process of CPU accessing memory based on the above figure. Assuming the address is 0x10000008 and the page size is 4K (12 bits), the virtual address will be divided into two parts: page mapping part (20 bits, 0x10000) + page offset (12 bits, 0x8). The CPU sends the address signal (0x10000008) to the MMU through the bus, and the MMU will take the page mapping part (20 bits) of the address to the TLB for matching.
What is TLB? Translation Lookaside Buffer, also known as "Translation Lookaside Buffer" on the Internet. Even the translator doesn't know what it does. Its function is to buffer the page table. I like to call it page table cache. Its structure is as follows:
As you can imagine, TLB is an index address array, and each element of the array is an index structure, including virtual page addresses and physical page addresses. It is represented in the form of registers inside the chip. Generally, registers are 32 bits. In fact, the page address in TLB is also a 32-bit register, but the index comparison is the first 20 bits. The last 12 bits are actually useful. For example, a certain bit can be set to indicate resident, that is, the index is always valid and cannot be changed. This scenario is generally designed for some encoding and decoding algorithms with particularly high performance requirements. Non-resident memory will generally be replaced at a certain time (such as accessing a new page address when the TLB is full).
1) If the first 20 bits of 0x10000008 hit the Mth index in the TLB, it means that the corresponding physical memory has been allocated to the virtual page in the physical memory, and the record has been made in the page table. As for whether the code page corresponding to the virtual address is loaded from the program in the external storage (flash, card, hard disk) into the memory, another mark is needed. How to mark it? It is to use a certain bit of the lower 12 bits of the TLB mentioned above (we call it K) to mark it. 1 indicates that the code data has been loaded into the memory, and 0 indicates that it has not been loaded into the memory. If it is 1, the physical address in M will be used as the upper 20 bits, and the page offset 0x8 will be used as the lower 12 bits to form a physical address and send it to the memory for access. At this time, the access will be completed.
2) If K is 0, it means that the code data has not been loaded into the memory. At this time, the MMU will output a signal to the interrupt management module, triggering an interrupt to enter the kernel state, and the operating system will be responsible for loading the corresponding code page into the memory. And modify the K bit of the corresponding page table entry and the K bit of the corresponding TLB entry to 1.
3) If the first 20 bits of 0x10000008 do not hit any TLB index, the MMU will also output a signal to the interrupt management module to trigger an interrupt to enter the kernel state. The operating system will shift 0x10000008 right by 12 bits (i.e. divided by 4K) to the page table to obtain the corresponding physical page value. If the physical page value is not 0 and is valid, it means that the code has been loaded into the memory. At this time, the value of the page table entry is filled into an idle TLB entry; if the physical page value is 0, it means that the actual physical memory space has not yet been allocated to this virtual page. At this time, the actual physical memory will be allocated to it, and the corresponding entry of the page table will be written (K is 0 at this time), and finally this index entry will be written into one of the TLB entries.
2) and 3) are actually completed in the interrupt kernel state, so why not do them together? The main reason is that an interrupt should not do too many things, which will increase the interrupt delay and affect system performance. Of course, if a chip makes both into one interrupt, it is understandable. Let's take a look at the structure of the page table. The page table can of course be made into an index array like TLB, but there are two disadvantages to this:
1) The page table is used to map all virtual pages, and its maintenance in memory also requires a lot of space. When the page size is 4K, the mapping is 4G/4K=1M indexes, and each index is 4*2=8 bytes, which is 8M memory.
2) If we use the TLB structure, the index matching process is a for loop matching circuit, which is very inefficient. You should know that we do this in the interrupt state.
Therefore, the general page table is designed as a one-dimensional array, that is, the entire linear virtual address space is used as the array subscript in units of pages, that is, the first word (4 bytes) of the page table maps the lowest 4K of the virtual address space, the second word maps the second lowest 4K of the virtual address, and so on. The Nth word of the page table maps the Nth 4K space of the virtual address space, that is, the address space of (N-1)*4K~4KN. In this way, the size of the page table is 1M*4=4M bytes, and the matching index is just an offset calculation, which is very fast.
Before introducing the second part, let’s first clarify two concepts:
1. Bank means code block, which is similar to the concept of page mentioned above.
2. Different codes use memory in different time-sharing modes: Different codes refer to codes corresponding to different virtual addresses (addresses after program linking are all virtual addresses). Memory is physical memory, that is, codes of different virtual addresses of a certain size run on the same physical memory space of a certain size at different times. Each different code block is a different code bank.
2. Software and Hardware Co-design of SOC Memory Management Unit in Controller Field
Here we specifically refer to SoC designs without a memory management unit. In order to reduce costs, if a 16-bit or 24-bit word length CPU can solve the problem when the performance is sufficient, a 32-bit word length CPU is generally not chosen, unless it is for computing performance considerations, or the license for a 32-bit CPU is cheaper (generally rare). As long as efficient memory management can be achieved and the purpose of physical memory time-sharing reuse can be achieved, it can be called successful or effective. Therefore, before introducing the real memory management unit hardware design, we first briefly introduce a mechanism for using a tool chain to achieve memory time-sharing reuse.
1. CODE BANKING mechanism implemented by tool chain
The CODE BANKING mechanism is a memory management solution for 16-bit 8051 series microcontrollers released by Keil C. Its purpose is to expand the access memory address space. The accessible space of a 16-bit microcontroller is 64K bytes. What if the program and data are larger than 64K (which is very likely if the system and application are slightly more complex)? The solution introduced by Keil C is to use the P1 port as an extension of the address line to access, supporting a maximum of 2M space, that is, 32 64K, requiring 5 P1 port lines. Its decoding (that is, the high and low level selection of the P1 port line) is automatically generated by the special compiler of Keil C during compilation. Of course, it is also necessary to actively indicate in the code and link script that a certain function is a bank code, otherwise all functions will insert a decoding code before calling, and the program will not run. And it also has a common code area, such as interrupt processing, operating system resident segment, etc., which also has a copy in each 64k.
We may ask, the CODE BANKING mechanism mentioned above only expands the accessible memory address space, and seems to have nothing to do with memory time-sharing reuse. In fact, we can use this mechanism to implement time-sharing reuse of different codes.
If an extended address line of P1 is not connected to the actual physical memory, that is, whether the line outputs 1 or 0, the physical memory accessed by this address is actually the same. For example, if the address is 0x10008, if the P1.0 line is not connected to the memory, then the 1 of the 17th bit actually has no decoding effect, that is, 0x10008 accesses the memory of 0x00008. In other words, two different addresses actually correspond to the same physical space. So we can compile and link different functions to different virtual addresses, but they will eventually be loaded into the same physical memory for execution. In this way, memory time-sharing multiplexing of different codes is achieved.
However, this method also has certain drawbacks:
1) There are only 32 banks at most, no more. It is difficult for general multimedia consumer product systems to meet this requirement.
2) Developers should always pay attention to the calling process and whether the Bank code has been adjusted. Debugging is also more troublesome. The 16-bit address must be combined with the P1 port to determine the actual physical address.
Of course, the original intention of the CODE BANKING mechanism is to expand the accessible memory address space, not for us to use in this way.
2. Software and hardware collaborative design
Finally, we are going to get to the point. In fact, with the above introduction, you will be able to easily understand the following explanation. We will imitate the real MMU to design our memory management unit hardware and consider the relevant details.
We should be aware that SoC design using CPUs in the controller field (such as 251, MIPS M series, etc.) is generally highly integrated, and K-level memory is integrated into the SoC using SRAM technology. The SoC memory management unit structure diagram is as follows:
Comparing this diagram with the MMU structure diagram, we can find the following differences:
1) There is no page table in SRAM.
2) The bank number register group can be understood as an index array with TLB
3) The core circuits of the Bank memory management circuit and the MMU management circuit are consistent and of course very simple. They both implement a cyclic matching process, which is actually a selector circuit.
It can be said that the Bank memory management circuit is a simplified version of the MMU unit, except that there is no page table in the memory. Let's take a look at its working mechanism and see how to map without a page table. This is done by the operating system, application, and Bank memory management unit together, and it is the architect's responsibility! This must be a textbook classic case of SOC hardware and software co-design!
1) SRAM memory is divided into resident code data area and several bank areas. Program modules that compete for memory use at the same time use the same bank space. For example, two application layer (music and FM) programs agree to use a certain bank in time-sharing, while two storage drivers (such as card driver and flash driver) agree to use another bank space in time-sharing. We call all virtual bank numbers that use a certain bank space a bank group. Therefore, multiple bank groups are stored in the system, and the actual physical address space used by each bank group is previously agreed.
2) A virtual address consists of two parts, one is the bank number, and the other is the offset within the bank. If the CPU word length is 24 bits, and the number of bits used to set the bank number is 8, then the virtual bank size is 16 bits, 64K. The physical address also consists of two parts, one is the bank group number (corresponding to the physical memory block), and the other is the offset within the bank.
3) The actual physical bank size is smaller than the virtual address bank size, which can be 1k, 2k, etc. The lower 16-bit address is the offset address within the bank, and the virtual and physical memory are one-to-one corresponding. The physical bank size is determined by the operating system, and the application can comply with it when writing module functions.
4) To divide different bank groups in the entire virtual address space, it is necessary to set the range of bank numbers corresponding to each bank group. Assuming there are 8 bank groups, the number of bank numbers in each bank group is 256/8=32.
5) The bank number register group corresponds one-to-one to the bank group set above. At a certain moment, only one bank number of each bank group is written into the corresponding bank number register.
6) Through the above conventions and analysis, we can know that for a virtual address, we can know which bank group it belongs to, and through the bank group and the offset within the bank, we can determine its exact address in the memory, which also means that it implements the function of the page table in the MMU. As shown in the following figure:
Assuming a certain address, let's imagine the working process of the memory management unit:
1) Match the high N bits of the address with the bank number register group. If the match is successful, it proves that the bank code where the address is located is already in the memory, and the memory management unit directly converts it into a physical address for access.
2) If it does not hit, it means that the corresponding bank code is not in the memory. Then an exception will be triggered and an interrupt will be entered. At this time, the operating system will determine which bank group the address belongs to, find out the corresponding file code, read it from the external storage device to the memory, and replace the corresponding bank number register. At this time, when the address is accessed again after returning, the interrupt will not occur again, but the address will be read as a hit.
3) Speaking of triggering exceptions, our bank memory management circuit is designed by ourselves, not the original module of the CPU. How to trigger an exception? One way is to have the memory management unit return an unknown instruction when it does not hit, and each CPU has an unknown instruction exception. At this time, the CPU will have an exception when executing this instruction and enter an interrupt. Because this instruction is artificially added, the PC value increases after execution. In the interrupt management program, the previously saved PC must be adjusted to the position before the unknown instruction is executed.
In addition, the architect should also customize the link script according to the rules defined in the above mechanism, so that application and driver developers do not need to worry about the implementation of low-level memory management when programming. The bottom layer is transparent to the upper layer, which is an excellent design!
1. Embedded system software layering
The system software layers include: startup, driver, operating system, file system, libc, middleware, application framework, application and other layers.
-
Drivers, file systems, and operating system time management, interrupt management, and other interfaces are generally called through APIs;
-
The processing of libc, middleware and application framework in the system may be called in the form of API, or directly linked with the application as a static library.
-
When libc, middleware, and application framework are used as static libraries, the space occupied by API will be reduced (API is often resident space, and there is no reason to load API code from external storage to memory when calling API, which is too inefficient). Eliminating the API layer can also increase the calling speed, but it will increase the code space of library functions. If the library function can run in the bank memory when linked, the increased code space can be ignored because the bank memory can be reused, which is another advantage. For example, when determining which decoding format a file is in, it can be implemented as middleware and linked to the bank space of the application, because this is the preprocessing before music decoding, and the same bank space can be reused with the control flow at the decoding time.
-
When libc, middleware, and application frameworks are called in the form of APIs, a resident memory space requirement for the API will be generated. There is only one real code in the memory for all modules to call together. Application developers do not need to worry about the interface implementation, and developers are not allowed to modify it.
Each module should decide the form in which it is called by the upper layer based on actual conditions.
For code paging (block, Bank) design, please refer to: SoC Embedded Software Architecture Design 2: Design Method for Implementing Virtual Memory Management in CPU without MMU and SoC Embedded Software Architecture Design 3: Code Block (Bank) Design Principles.
2. Program segment composition
Here, program segments refer to segment names that appear in executable files, such as default segment names such as .CODE, .DATA, and .BSS, and other custom segment names. For the GNU toolchain, various compilation output segment names can be specified in the link script. Of course, you can also specify the compilation output segment name of a function or code when writing code. For example, when adding an attribute __attribute__((section("bank_data"))) when defining a data variable, the data variable will be relocated to the bank_data segment. The following figure shows the correspondence between a program with a Bank code segment and an executable file segment name:
3. SoC built-in memory planning
Generally, if the built-in SRAM in the SOC exceeds 32K, digital engineers will also divide the built-in memory into blocks, one is to reduce circuit delay, and the other is to make more efficient use of memory. For example, a block of memory is used as code at a certain moment, and sometimes it may be used as data (if it is a Harvard structure, it is necessary to switch the memory addressing decoding circuit from the code space to the data space), and sometimes it may be used as a special decoding buffer. Some decoding caches are based on 24 bits as a unit. If all memory is designed as a block, it is obviously not able to meet such requirements. The following figure is a common SRAM schematic diagram:
4. Program Memory Space Allocation
Based on comprehensive considerations of software layering and program segments, the memory area is generally divided into layers based on the physical memory, and then the memory segments of each layer of program are divided. The following principles apply:
-
The resident segments (code and data) of each layer should be allocated compactly, and the bank space and resident space blocks of each layer should also be allocated compactly.
-
The starting address of the bank space should be aligned with the sector unit to achieve the best code loading performance.
-
First allocate memory for common scenes, then consider the needs of special scenes and see if special scenes can reuse the memory space of common scenes.
-
The division of buffers should also take into account the reuse of scenes, otherwise it will be too wasteful. For example, the decoded buffer can be used as a buffer for judging the validity of media files during preprocessing before decoding.
-
Sometimes two sets of bank spaces can be combined and used as a set of bank spaces in another scenario. For example, the software layering during decoding is relatively large, involving application middleware and algorithm middleware, while the file browsing application does not have so many layers. The banks of the two middleware can be combined and used as a set of banks.
-
The code of a module should not span two physical memory blocks, otherwise the access performance will be degraded.
-
Maximize memory utilization and avoid memory fragmentation.
-
The details of memory allocation must appear in a public link file with meaningful names defining the starting address and length of each segment, and no one except the architect is allowed to modify the file.
1. Basic principles
Drivers are classified by modules in the system, such as KEY driver, LCD driver, file system, card driver, I2C driver, etc. Each module has multiple interfaces, for example, LCD driver has cursor positioning, dot drawing, line drawing, etc., while the file system has fread, fwrite, fseek, fopen, etc. The following example will take the file system's fopen as an example.
2. Design and Implementation Methods
1. Driver interface declaration: extern FILE * fopen(const char * path,const char * mode), located in fs.h
2. Driver interface definition: FILE * open(const char * path,const char * mode){...}, located in fs.c.
3. Driver interface API:
fopen :
li v1, FILE_OPEN;
syscall; is located in api.S and is assembly code. Here we take mips as an example.
4. Driver interface function pointer array:
struct file_operations fs_fops { open,read,write,seek};
5. When the file system is loaded, the file system interface function pointer array fops will be registered in the system's API management array.
6. The system manages drivers by categories. It has a global array that records the base address of each driver interface function pointer array. Each driver agrees on the order in advance. For example, the first element of the array is key_fops for the key driver, and the second is lcd_fops for the LCD driver, and so on. When a driver is loaded, the driver will record the corresponding fops to the corresponding position of the array through the API management interface.
This convention is usually in api.h, such as #define KEY 0 // indicates that the key driver is agreed to be in the first position of the array, #define FS 2 // indicates that FS is agreed to be in the third position of the array.
7. FILE_OPEN definition:
#define FILE_OPEN (FS<<N)+0
In fs.h, it indicates that fopen is the first interface provided by the file system. This constant contains two parts of information: one is the index of the file system in the API management, and the other is the index of the interface in its own driver interface.
3. Calling process
When the application calls, parameters such as path and mode are pushed into the stack or registers (MIPS stores both the stack and registers (a1, a2, a3, a4), while ARM prioritizes registers and stores them on the stack if they are insufficient). Then, the fopen API is entered, which assigns the FILE_OPEN constant to v1, falls into an exception through syscall, and enters the kernel state. At this time, the API can enter the API management. The API can quickly find the address of open based on the two parts of information provided by FILE_OPEN. When the exception is returned, it jumps to the address of open for execution, which is the actual interface call. The whole process is completed.
4. API parameter transmission efficiency
The API transfer includes two parts: one is the parameters of the interface itself, such as the file name and operation method of fopen, and the other is the API index number, such as FOPEN. Considering that the management after the system falls into the trap will eventually jump to call the target interface, the parameters of the interface itself should be placed in front of the API interface (and try to store them in registers, so the parameters should not exceed 4), and the index FOPEN should be placed at the end (the mips assembly here puts this parameter in the return value register v1). In this way, the management after the trap does not need to adjust the parameters. This is a problem that needs to be considered in mixed programming of assembly and C language.
Resource-scarce embedded systems are generally controlled by single-chip microcomputer controllers, and memory resources are at the K-byte level. Resource-scarce electronic products are cost-sensitive, and the cost is proportional to the memory capacity, especially when designing SOC chips, which is the focus of firmware development. Before mass production, the size of the on-chip RAM must be determined, and the smaller the better while meeting functional requirements. The conventional practice of saving memory is generally to achieve the goal through efficient use of memory. In addition, there are other important software design techniques, which naturally require the developer's system software design capabilities and programming development skills.
Here we take low-end multimedia electronic products as an example. They usually customize a streamlined operating system with modules such as drivers, middleware and applications. As the saying goes, although small, it has all the necessary components.
1. Memory time-sharing multiplexing
Time-sharing reuse means managing the code in blocks (Bank). Its design requirements come from:
1. Many electronic products do not run multiple applications at the same time like current Android phones. At most, they can browse pictures while listening to music. The same is true for non-smartphones. But we also see that there are many applications in electronic products, such as settings, e-books, phone books, recordings, etc. Therefore, it is natural that applications that run at different times occupy the same memory space.
2. Driver space. Many drivers are not used at the same time. For example, when listening to FM, the FM driver is used, and when listening to music, the decoder is used. Therefore, many drivers can also occupy the same space.
3. Reuse of middleware. For example, the re-encapsulation and use of UI and hardware drivers, etc., which are directly called by the corresponding applications, generally also have the need for reuse.
4. Reuse of data segments. Applications and drivers both have data and also have reuse scenarios.
Theoretically, drivers and codes can also use space, but there are too many details to consider, and this is not scalable, so applications generally do not reuse space with drivers. Generally, the software system is roughly divided into the following parts: startup, driver, operating system, middleware, application and other levels. Startup is a one-time execution, and there is no need to consider too much reused space. Operating systems generally have a need for resident memory, such as interrupt management, time management, scheduling management, module code management, virtual file system, etc. Of course, some functions of the operating system do not need to reside in memory, mainly some interfaces with less frequent calls, such as driver loading and unloading, application initialization and other modules. Some interface implementations that do not require resident memory can also reuse space with drivers.
2. Memory block size
Consider the skill of the system architect. If the block is too large, it will be wasteful; if it is too small, it will lead to frequent code switching and inefficiency. Since it is all RAM, sometimes data can be placed in the same block as the code segment, and there is no need to add a separate data block. Of course, these details need to be comprehensively evaluated and designed in detail. In cost-sensitive electronic products, these techniques need to be explored and utilized.
3. ROM instead of RAM
This is just to save memory resources from the perspective of cost. Some codes need to be resident, but their content will not change with the version update, such as the scheduling management of the operating system mentioned in the previous section. You can consider solidifying these codes into ROM. In theory, most of the operating system codes that need to be resident in memory can be solidified. The cost ratio of RAM and ROM of the same size is about 6:1, so using ROM can also greatly reduce costs.
4. System transplantation and tailoring
This is something that operating system designers must consider. Each product has unique functions, but the underlying operating system is universal. In a resource-constrained system, it is very wise to cut off unnecessary modules.
5. Refactoring the executable program
The systems we describe are generally closed systems. As long as the functions can be realized efficiently, we can arbitrarily change all the codes in the system. For example, for executable ELF files, if the operating system follows the standard process to parse the ELF file before loading it, it not only requires a lot of memory data, but also is inefficient. The most important part of ELF for loading and executing is the .CODE, .DATA, .CONST, .BSS and other segment information. We can completely extract it offline to generate a new simple custom file format. The operating system only needs to parse this simple file. Doing so not only saves memory, but also saves external storage space, and can achieve the purpose of efficient loading.
6. Programming skills
This requires daily accumulation. For example, in terms of variable arrangement, we all know that the compiler will consider alignment. Obviously, the following first definition requires more memory than the second one.
1) char a; int b; char c;
2) char a; char c; int b;
7. Algorithm Design
A good algorithm is generally lightweight and efficient.
8. Compilation Optimization
When compiling, choose a high optimization level, so that the size of the generated code is greatly reduced.
9. Instruction compilation mode
For example, thumb instructions are selected in arm, and mips16e is selected in mips. This is determined by the architecture. The architecture is also designed with 16-bit instruction mode to save code space resources, and the word length of these CPUs is often 32 bits. This method can reduce the amount of code by about 30%.
10. Stack Space Planning
Each thread will have its own stack, and the stack of each thread should be set according to the call depth of its thread. For example, UCOS has a stack utilization task. We might as well use this idea to look at the final stack depth of a thread.
Setting an independent interrupt stack can avoid each task stack having to reserve stack space for interrupts.
The stack used by the flat function call method is generally smaller than that of the vertical function call method. In embedded development, for the sake of efficiency and resources, the code should not be divided too finely. A large number of functions will not only increase the amount of code and the stack, but also reduce the operating efficiency.
11. Customize the link script and plan the memory space reasonably
For example, when we plan the space, we often separate the code segment and the data segment, but the actual code segment may not be fully used. In this case, we can locate some variables after the code segment. This way, we can make full use of the memory without leaving "fragments".