[Playing with Pioneer SPI peripherals series 3] Summarizing SPI host performance by taking reading and writing serial NOR flash as an example
1. Off topic
This article mainly uses NOR flash as a slave device to verify the performance of the SPI master, which is not limited to driving flash devices.
For those who really need to drive NOR flash devices, Xianji provides a more convenient and better-performing peripheral - XPI peripheral. It not only supports a variety of external memories, but also supports 1/2/4/8 data modes. The key point is that you only need to call the ROM API interface, without having to write your own driver!
This article mainly introduces SPI peripherals, and XPI peripherals are not within the scope of this article.
2. Purpose
In the first three articles of the series, we introduced the implementation scheme of maximizing the transceiver performance of SPI peripherals and the definition of the component serial_nor--SFDP of the HPM_sdk . SFDP can help us get the relevant devices of the flash device, such as erase, read and write instructions. Since the HPM6000 series of HPM6000 transmits and receives 512 bytes at a time (it is said that the subsequent new series will not be limited to 512 bytes), for flash devices, from the perspective of reading flash, the address data length of more than 512 bytes can be read, and from the perspective of writing flash, at most one page (256 bytes or 512 bytes) can be written at a time. Therefore, from the perspective of practical application, SPI transmission will not be limited to 512 bytes.
If you want to use a transfer amount of more than 512 bytes, from a performance perspective, whether it is 1-wire or 4-wire transfer mode, DMA is definitely the best. This is also the focus and purpose of this article.
3. Theoretical Reading and Writing Performance
Before verifying the performance of the SPI master, we need to know the performance of the slave device, such as the performance of writing a page of 256 bytes and erasing a sector or block.
The device used in this article is Winbond W25Q64JV. From the perspective of page programming, the typical time for a page programming is 0.3ms, and the typical value for erasing a 4K sector is 45ms. Assuming that it is purely programming, the write speed is about 625KB/S.
From the reading point of view, it mainly depends on the SPI frequency supported by the flash. From the manual, reading instructions other than 03H normal reading instructions (such as EBh four-wire reading) can reach 133M, so theoretically the four-wire SPI can reach a maximum of 66.5MB/S. The single-wire SPI can only connect to 03H instructions, so the maximum frequency is 50M, and the maximum can be 6.28MB/S
This article is limited to connecting FLASH with Dupont wires, so the speed of the four-wire QSPI four-wire is limited to 50M. Of course, if it is connected to the PCB, then the 80M of the SPI peripheral should be no problem.
Therefore, at the four-wire QSPI frequency of 50M, for the flash device used in this article, the theoretical programming speed should be 625KB/s, and the theoretical reading speed should be 25MB/s.
3. Verify theoretical reading and writing performance.
1. Overview of DMA of HPM6000
The DMA of the HPM6000 series of Xianji consists of two controllers, one XDMA with a 64-bit bus width connected to the AXI bus; and one HDMA with a 32-bit bus width connected to the AHB peripheral bus. The DMA request router has a total of 16 channels, and the two controllers are divided into 8 each.
DMA supports both non-chained and chained transfers. Chained transfers can complete multiple different configuration transfer tasks continuously without CPU intervention. In other words, each chain stores the descriptor of the configuration transfer, but after each chain transfer is completed, as long as the next chain address is legal, the transfer will continue. However, it should be noted that the descriptor can only be up to 4K memory. The format of each descriptor occupies 32 bytes.
This article uses non-chained DMA and chained DMA to illustrate the optimization of transmission performance.
(II) SPI start transmission process
After configuring the relevant initialization of SPI, such as pin initialization, SPI mode, clock distribution, etc., the next step is the transmission process.
The transfer process requires the following three steps:
1. Configure the TRANSCTRL register. This register mainly configures the address segment and command segment enable and disable, data segment format (1/2/4 lines), transmission mode (such as simultaneous read and write, read-only, fill dummy and then read), write length, read length, etc.
2. Configure the CMD register. When SPI is used as the host, this register needs to be assigned a value regardless of whether the command is disabled or not, because once the value is assigned, it marks the start of transmission and sends the relative number of SCLK clocks according to the write or read length required for transmission.
3. Then read or write the corresponding data in the DATA register. If it is a polling read and write FIFO mode, then the send and receive FIFO is blocked. If it is a DMA mode, then the DMA source address and device address and other parameters need to be configured.
For these transmission processes, hpm_sdk already has corresponding API interfaces:
For the polling method, please refer to the article " The SPI peripheral of the HPM6000 uses four-wire mode to read and write Winbond flash ". This article will not elaborate on it.
For DMA mode, the API interface provided by SDK is spi_setup_dma_transfer. After that, developers need to configure DMA parameters by themselves, and then call dma_setup_channel API to start transmission. This interface can be used for both sending and receiving. For detailed reference, you can also see the DMA example in the sample/drivers/spi folder in hpm_sdk.
(III) Unchained DMA transfer
Since flash is in the read and write operation API, from the user's perspective, the interface only needs to pass the buffer address to be read and written, and the length of the transfer. The internal logic does not care about anything. Here, this article also encapsulates two interfaces. They are program and read APIS.
From the program API, the incoming programming length needs to be split internally, and the corresponding split length offset is also performed on the programming address and buffer address. Then, the corresponding Program instruction is obtained through SFDP according to the flash address mode (24-bit or 32-bit). In addition, DMA parameters need to be configured, so this part also requires a certain amount of instruction execution time loss. When writing pages in a cyclic DMA, it is also necessary to wait for the page programming to be completed, which requires a waiting time of 0.4ms.
From the read API, since the maximum length of the hpm6000 is 512 bytes, the internal alignment is also required to split the read length, and then the corresponding Program instructions need to be obtained through SFDP according to the flash address mode (24 bits or 32 bits). In addition, the DMA parameters need to be configured, and a certain amount of instruction time is required. However, during the reading process, there is no need to wait for the flash busy state. After sending 512 bytes, the next frame of 512 bytes will continue to be sent, and this cycle will continue.
In order to minimize the execution time of instructions, for HPM's DMA, after each transfer, it is necessary to reconfigure the transfer length, the source address and the device address of the transfer, and then restart the DMA. These separate configurations are also provided by HPM. This can be used as an independent API interface.
In order to verify the DMA read and write flash transmission performance under large data volume, this article uses SPI frequency 50M, 15K data to read and write flash, and opens the maximum optimization O3. In addition, an IO is added to test the execution time of the above API.
The speed of 15K transfer is calculated by timing difference, as shown below. The programming performance of the flash device itself is 630KB/S. It is clearly achieved here. However, from the reading speed, the theoretical speed is 25MB/S, but the actual speed is 22MB/S.
By using a logic analyzer to capture the SCLK, CS, and IO test pins, we can see:
When entering the read API, the transmission is not executed immediately, but there is a configuration logic processing time and a DMA configuration time for each packet.
1. Test the read API and receive 15K data. The total time consumed is 692.742us, which is 22.172MB/S, which is consistent with the print result.
2. The total time consumption is mainly due to the entry unpacking and transmission configuration, which takes 11.666us, and the total time consumption of DMA configuration transmission of each 512-byte packet is 56.5us. The total time consumption is 68.166us.
3. Then the actual data transmission time is 692.742 - 68.166 = 68.166us. That is, 15360/624.576 = 24.6MB/S. This section also includes the transmission of addresses and instructions. In this way, the actual DMA transmission can still reach the theoretical 25MB/S.
(IV) Chained DMA transfer
In DMA non-chained transmission, the main time-consuming part is the time-consuming reconfiguration of DMA after each 512-byte packet is transmitted. Although each packet takes about 1.8us, the total number of packets accumulates, and there is still a certain amount of time loss. So is there a way to eliminate the interval between each packet? From the above description, we can know that chained DMA only needs to put the SPI transmission process in the DMA descriptor before transmission, so there is no need for CPU intervention, and let DMA execute it by itself. Let's verify this idea together.
1. Each SPI transmission needs to configure TRANSCTRL, CMD, DATA<->BUF DMA configuration, so each SPI transmission chain needs three DMA descriptors, each descriptor occupies 32 bytes, and the DMA descriptor boundary length is 4K. Then the maximum non-chained transmission can transmit 4096/(32 * 3) * 512= 21845 bytes, which is 21K transmission length. The writing of descriptors also has a corresponding API interface-> dma_config_linked_descriptor
2. When enabling the first DMA transfer, you only need to point the linked_ptr address to the first address of the written descriptor space.
After verifying again, we can see that chained DMA does not improve much compared to unchained DMA, and is even slower.
Also test SCLK, CS and IO through logic analyzer.
Why chained DMA transfer is slower than unchained transfer? From the analyzer, we can see that there are several reasons:
1. It takes a long time to configure DMA descriptors, consuming 49.088us. This is because the data address and data length are uncertain, and it is inevitable that the descriptors are filled in the interface.
2. There will be an interval between chains, which is 1.266us in the figure. This is the delay interval for executing the CTRL and CMD registers and writing CMD to confirm the transmission. This time is reasonable and unavoidable.
3. Due to the SPI transmission at one time, there are two registers in the DMA descriptor. DMA transmission can reduce the time for large data transmission. However, for the assignment of registers, the main frequency of hpm6000 is as high as several hundred MHz, which can often be done with a few instructions. Therefore, the configured DMA interval for each packet transmission on non-chained DMA is also similar to the chain interval time of chained DMA, and setting the CPU can be faster.
4. Then the actual data transmission time is 613us (650 - 1.26 * 30). That is 15360/650 = 25.03MB/S. This section also includes the transmission of addresses and instructions. In this way, the actual DMA transmission can obviously reach the theoretical 25MB/S, which is even faster than the non-chained DMA.
IV. Conclusion
1. Since the transmission length of the SPI peripheral of the HPM6000 is limited to 512 bytes, it is necessary to perform packetized non-chained DMA transmission or use chained DMA transmission in the transmission of large amounts of data. Under instruction optimization, the ideal speed can also be achieved.
2. For SPI peripherals, chained DMA transfer requires a portion of execution time to fill in the DMA descriptor, but the DMA data transfer after filling does not require CPU intervention, which is faster than packetized non-chained DMA transfer.
3. Non-chain DMA, DMA needs to be reconfigured for each packet, which will increase the interval time between each packet. This requires the developer to optimize it. However, since the main frequency of hpm is high enough, the interval will generally not exceed 1.28us.
4. Chained DMA transfer has the most obvious advantage if the data being transferred is a large amount of data. However, for DMA transfer of a register, due to the frequency advantage of HPM, the CPU will execute the assignment faster than DMA.
5. From the perspective of the SPI peripheral as the host, the effect of chained DMA is not as obvious as that of packetized non-chained DMA because each group of SPI transmission requires the transmission of two register values. From the perspective of ease of use, the packetized non-chained DMA transmission solution is given priority.
6. The SPI peripheral host function of hpm ends here.