Accelerating Packet Processing with Xilinx FPGAs-EEWORLD

Collect

By Andy Norton

Distinguished Engineer, Office of the CTO, CloudShield Technologies, SAIC Email: ANorton@cloudShield.com As 10Gb Ethernet matures and the industry looks forward to 40GbE and 100GbE Ethernet, the next generation of network infrastructure is emerging. Converged networks present new challenges for scalable and open platforms in terms of traffic processing. The next generation of converged infrastructure backplanes typically consist of high-performance terabit (TB) switching fabrics and programmable content processors that can handle up to 10 Gb of application layer traffic in a variety of applications that are growing in complexity and are constantly emerging. CloudShield has created a new family of programmable packet processors that can inspect, classify, modify, and replicate packets, integrating dynamic interactions with the application layer. Our Flow Acceleration Subsystem (FAST) uses Xilinx® Virtex®-class FPGAs to perform packet pre-processing for the CloudShield Deep Packet Processing and Modification blades. These FPGAs contain 10Gb Ethernet MACs and are equipped with an ingress processor for classification and key extraction, an egress processor for packet modification, a packet queue using quad data rate (QDR) SRAM, a Xilinx Aurora-based message transmission channel, and a search engine based on ternary content addressable memory (TCAM). Our FPGA chipset can complete packet caching and processing with minimal CPU involvement, achieving high-performance processing capabilities of up to 40Gb per second. It uses a 2-7 layer field query method to perform packet modifications in a flexible and deterministic manner at line speed based on dynamically reconfigurable rules. Core functions of FAST packet processors The deep packet processing blades we currently deploy use two blade access controller FPGAs and a packet switching FPGA, all of which are implemented using LX110T Virtex-5 FPGAs. Each blade access controller has data plane connectivity using two Xilinx 10GbE MAC/PHY cores, an inter-chip interface based on Xilinx ChipSync ^TM technology, and packet processing using Xilinx IP cores. The packet switching PFGA uses the standard Xilinx SPI-4.2 IP core to interface with our network processor (NPU) and our IP core search engine. To focus the design of the SoC on packet processing, we used standard Xilinx IP cores whenever possible. We chose the Xilinx 10Gb Ethernet MAC core with dual GTP transceivers to implement the 4 x 3.125-Gbps XAUI physical layer interface. For the NPU interface, we used the Xilinx SPI-4 Phase 2 core with dynamic phase alignment and ChipSync technology that supports up to 1Gbps per LVDS differential pair. Our main packet processing IP cores are as follows: • FAST Packet Processor: The FPP’s Ingress Packet Processor (FIPP) is responsible for Layer 1 packet parsing, hash generation of keys and flow IDs, and per-port Layer 3-4 checksum validation. The FPP’s Egress Packet Processor (FEPP) performs egress packet modifications and recalculates Layer 3-4 checksums. • FAST Search Engine: Our FSE maintains a flow database in TCAM and QDR SRAM that is used to decide what processing needs to be performed on ingress packets. The FSE accepts a key message from the FIPP on each port, decides what processing needs to be performed on the packet, and then returns the resulting message to the queue that originally sent the message. • FAST Data Queue: Our Data Queue (FDQ) stores incoming packets in an “out-of-order” holding buffer. This queue sends the key message from the FIPP to the FAST Search Engine when an ingress packet is written to the QDR SRAM. The FSE uses this key to decide how to handle the packet and then returns the result message to the FDQ. Based on the result message, the queue can forward, copy, or drop each buffered packet. In addition, the queue can independently modify packets that have been forwarded or copied. Data Flow Input and Output Figure 1 shows the data flow through our traffic acceleration subsystem. The core FPGA functions are shown in green, the packet data flow is yellow, the control messages are blue, and the external devices are gray. First, we can identify the start of the customer data flow from the packets received from the 10GbE network port. Packets on each port enter the FAST ingress packet processor for packet parsing and analysis (number 1 in the figure). After classifying the protocol and packet, the FIPP can locate the header offsets of layers 2, 3, and 4. Next is data flow hashing and key extraction (data flow selection lookup rules, such as the 5-tuple method using source IP address, destination IP address, source and destination ports, and protocol). At this point, our queue manager buffers receive packets to free up memory pages in the external QDR SRAM. Packets received at this stage are considered out of order. We place them in the external QDR SRAM while waiting for FAST scheduling. The FAST Data Queue (number 2 in the figure) assigns a packet ID and dispatches a key message to the FAST Search Engine (number 3 in the figure). The FAST Search Engine uses the key to identify the data flow. The matching data flow entry in the external TCAM provides an index to the data flow task table (Flow Action Table) in the associated SRAM. The matching data flow task is determined based on the application subscription configured by the customer. The FAST Search Engine replies with a result message to the FDQ (number 4 in the figure), and the task scheduler assigns the packet to an output queue based on its assigned task. We then dequeue the packet from the packet queue to a dedicated destination output port (number 5 in the figure), where our FAST Egress Packet Processor (number 6 in the figure) can handle the packet modification as required by the assigned task according to the rules in the data flow modification table (Flow Modification Table).

If our FAST search engine is able to match the customer data flow, it will perform the specified task, if not, it will perform the default rule (drop or send to NPU). The basic tasks we allow include: drop the packet, forward the packet directly to the network port, forward the packet to the exception packet processing NPU, or copy the packet and forward it according to independent rules. Our extended tasks include packet collapse (delete part of the packet), packet expansion/write (insert a range of bytes in the packet), packet overwrite (modify a range of bytes), and their combination. For example, a packet overwrite rule could be to modify the MAC source or destination address, modify the inner or outer tag of a VLAN, or change the Layer 4 header tag. Examples of insertion/deletion can be as simple as removing an existing EtherType, inserting an MPLS label or VLAN Q-in-Q tag, or as complex as inserting an IP header as a GRE delivery header followed by a GRE header (Generic Routing Protocol Encapsulation (GRE) is a tunneling protocol, see Internet RFC 1702 for details).

Figure 1 – Data flow in the stream acceleration subsystem Figure 2 – 5-tuple key extraction FAST packet processor

for Type II Ethernet TCP/IP packets The FAST ingress packet processor decodes all packets to determine the Layer 2, Layer 3, and Layer 4 content (if present). After the initial Ethernet Layer 2 decoding is complete, the packet can be further processed at Layer 2. We then proceed to Layer 3 to process either an IPv4 or IPv6 packet. Assuming we find one of these Layer 3 types present, we proceed to Layer 4 processing. While the packet is being decoded, our key extraction unit is also locating and storing the key fields to generate a search key that our FAST search engine can use for later data flow lookups. Figure 2 shows the Ethernet Type II TCP/IP packet format and the standard 5-tuple key to be extracted, along with the resulting key extracted in this example. We can also perform IP, TCP, UDP, and ICMP checksum calculations on various packets at both the ingress and egress processors. Two Virtex-5 FPGA DSP48E slices provide the adders needed for checksum calculation and verification. Our first DSP aggregates the data stream within 32-bit boundaries, while the second DSP is responsible for folding the resulting total into a 16-bit checksum at the end of the calculation of the relevant layer. We then perform the checksum calculation; for recalculation, we can clear the checksum byte position of the incoming data stream and use a storage buffer to reinsert the inverse of the checksum result. The pseudo header bytes required by the layer 4 checksum can be multiplexed into the incoming data stream for the final calculation. The FAST egress packet processor at each output port can perform packet modification and layer 3 to layer 4 checksum recalculation and insertion according to the rule table (rules are stored in internal BRAM). This FEPP goes beyond the traditional "fixed function" scheme of packet modification to modify the packet by overwriting, inserting, deleting or truncating it according to the specified modification rule number. Our data flow modification rules support the specification of opcodes that represent the type of operation, OpLoc for the start position, OpOffset for the offset, Insert Size for the inserted data size, Delete Size for the deleted data size, whether to perform layer 3 and layer 4 checksum calculations and insertions, and whether to chain modification rules. Our next-generation implementation not only significantly improves performance, further enhances cache capabilities, but also adds new features. By upgrading our FAST chipset to a single Xilinx Virtex-6 FPGA, we can not only increase the functionality, interfaces, and performance of the next-generation FAST to an unprecedented level, but also reduce board space and power requirements to achieve a single-chip deep packet processing coprocessor unit. We can use the packet overwrite feature to simply modify existing fields such as MAC destination address, MAC source address, VLAN tag, or even a single TCP flag. If only the MAC destination address needs to be modified, the "task" that FEPP receives when receiving the packet will be used, for example, rule 2 in the flow modification table (Figure 3). Rule 2 is preconfigured to specify the Opcode (Override), OpLoc (location in the packet, e.g., Layer 2), OpOffset (offset from the start), Mask Type (what bytes to use), and Modify Data (what data to actually overwrite). The result is that the 6 bytes starting at the Layer 2 location are overwritten with the preconfigured modification data. Figure 3 – Simple MAC Destination Address Override Modification Another example of overriding is the scenario shown in Rule 6, where we want to modify a specific TCP flag, such as ACK, SYN, or FIN (see Figure 4). This rule will use the Opcode (Override), OpLoc (Layer 4), OpOffset (0 offset from Layer 4), Mask Type (byte 14 to use), and Bitmask (which bits in the byte to mask). We can use the Mask Type to include or exclude specific bytes, thereby specifying multiple fields for overriding. Figure 4 – Override Modification of TCP Flags Our override capability is not limited to what is stored in the Flow Modification Rule Table, but can also include what is stored as associated data in the Flow Action Table. Rules can be specified so that the associated data transmitted to the FEPP is part of the action, significantly expanding the range of data available for modification. As a result, for example, the entire VLAN tag range can be covered. Our Insert/Delete capability enables even more complex packet modifications. Take Rule 5 (see Figure 5) as an example, using our Insert/Delete capability. The various actions associated with Rule 5, including Opcode (Insert/Delete), OpLoc (Layer 2), OpOffset (starting at 12th byte), ISize (Insert Data Size = 22 bytes), DSize (Deleted Byte Size = 2 bytes), and Insert Data (0x8847, MPLS Label), will delete the existing EtherType and insert a new EtherType = 8847, indicating that the new packet will be an MPLS unicast packet, followed by the MPLS label set specified by the Insert Data. Figure 5 – MPLS label insertion modification Layout planning and timing closure

The most significant challenges we faced in designing our unique packet processor were the increasing complexity of FPGA designs, the increasing routing and utilization density, the integration of various IP cores, the use of multiple hard logic objects (such as BRAM, GTP, DSP, and similar objects), and the lack of data flow planning in the earliest stages of the project. The bit files we released for Phase 1 Virtex-5 FPGAs were based on low utilization density, especially low BRAM utilization density, which resulted in relatively simple timing closure. As new and significant new features were added at a later stage, the utilization density of BRAM approached 97%, and we became acutely aware of the importance of optimizing floorplanning and how decisions made early in the product life cycle would affect later stages.

The main goal of floorplanning is to improve timing by reducing routing delays. To this end, it is very important to take data flow and pin configuration into account during the design analysis process. Xilinx PlanAhead ^TM, now integrated with ISE ^®, serves as a single point tool for floorplanning and timing analysis, providing interactive analysis and visualization of how we navigate the complex network to achieve timing closure in a highly utilized design. PlanAhead gives us insight into our design, that is, the minimum number of constraints we need to provide to guide the mapping, placement, and routing tools to fully meet our timing requirements. We found that to do this, we often needed to optimize the placement of a portion of the critical BRAMs beyond the block-based design area constraints. In retrospect, if we had spent more time using PlanAhead to verify what-if solutions at the beginning of the project, helping us see the optimal data flow and pinout, our tasks later in the design would have been much easier. Dynamic Adaptive Packet Processing Our integer streaming acceleration subsystem can inspect and modify packets at line rate with the highest degree of flexibility, while dynamically interacting with application layer services to achieve highly adaptive packet processing. Virtex-class FPGAs are an important enabler, providing a system-on-chip platform that cannot be achieved with previous generation FPGAs, accelerating content-based routing and implementing important packet processing functions. Our next generation implementation significantly improves performance, further enhances cache capabilities, and adds new features. By upgrading our FAST chipset in a single Xilinx Virtex-6 FPGA, we were able to increase the functionality, interfaces, and performance of the next generation FAST to an unprecedented level while reducing board space and power requirements to enable a single chip deep packet processing coprocessor unit. Acknowledgements Without a good team, complex and advanced FPGA designs like ours would be nothing. I would like to thank several members of our elite FPGA team: Creg Triplett, FPGA group leader and search engine design leader; Scott Stovall, data queue design leader; Scott Follmer, packet processing design leader; Steve Barrett, verification group leader, and Isaac Mendoza, system Verilog expert and verification engineer.