Longfei Li, Zhanzhuang He, Jianfeng Wang, Yangchun Shi and Haiqiang Feng
(Department of Integrated Circuit Design, Xi’an Microelectronics Technology Institute, Xi’an 710065, China)
Abstract: Dealing with a hardware acceleration method, small segment coalescing (SSC) was proposed to achieve the acceleration of TCP/IP processing in the receiving process. To reduce the number of data copy, CPU interruptions and TCP/IP processing, SSC combines small received TCP segments that belong to the same TCP/IP connection into a large TCP package in Network Interface Card (NIC). The whole process is implemented by hardware in NIC so that SSC remains transparent to upper drivers. Based on the intensive study on TCP/IP protocol and NIC mechanism, the coalescing policy is carefully designed to make sure that SSC can make a reasonable decision on whether or when to start or finish coalescing without delay. In addition, SSC is implemented and integrated into LCE5718, which is a totally self-designed dual-port Gigabit Ethernet controller. Finally, the simulation environment is constructed to verify the function of the design. A field programmable gate array (FPGA) prototype is set up, and experiments are conducted to show the performance of SSC in different configurations.
Key words: hardware acceleration; network interface card; TCP/IP processing; Ethernet; small segment coalescing
Cloud computing, big data and internet of things (IOT) have long been the focus for concern due to the dramatic increase of network bandwidth. As the most widely used local area network (LAN), Ethernet has achieved rapid development in the past several decades. When Ethernet was firstly introduced in the last century, the bandwidth was only 10 Mbit/s. However, the bandwidth of Ethernet can be as high as 100 Gbit/s at present, making it more popular in all kinds of application. High bandwidth is one of the basic requirements to achieve cloud computing, big data, IOT, etc. Another basic requirement is strong computing capability[1]. According to Moore’s law, processor speeds or overall processing power for computers will double every 18 months. However, the rate of increase of network bandwidth is faster than that of CPU speed. Gilder’s law indicates that the bandwidth grows at least three times faster than CPU speed[2]. This means that if the computer power doubles every 18 months, then the communications speed doubles every 6 months. The speed gap between CPU speed and Ethernet bandwidth is increasing every year since the 10 Gbit/s Ethernet standard was proposed in 2001. In this circumstance, CPU, rather than network bandwidth, has gradually become a bottleneck of system performance. Although multiple-CPU and computing cluster solutions have been proposed to solve these problems, they are not suitable for individual users due to the expensive overhead. Hence, there is an urgent for the need to solve these problems through other methods.
Essentially, network data processing is I/O-intensive and relates to several hardware devices (e.g. NIC, PCI-E, memory, CPU) and system components (e.g. NIC driver, TCP/IP protocol). The cost of network data processing on CPU can be categorized under three main headings[3]. The first one is network protocol processing. According to Refs.[4-6], when adopting gigabit Ethernet, TCP communications at near wire speed (above 800 Mbit/s) cost around 30% of a 2.4 GHz Pentium IV processor. When the Ethernet bandwidth rises to 10 G, and there is no acceleration in NIC, in the worst case, almost two cores of quad-core Intel Xeon processor would be exhausted for network processing.
The second one is data copy. Taking data receiving as an example, the data from the network usually would be copied three times: from the cable to NIC, from NIC to host memory, and from host memory to kernel, respectively. This action consumes thousands of CPU cycles for each packet[7].
Interrupt frequency is also able to affect CPU performance. Generally speaking, CPU has to suspend current tasks and handle interrupts first, then return to suspended tasks after the interrupt processing.
In light of these facts, in addition to improving CPU performance or adopting the latest transport layer protocols, such as iWARP and RDMA over converged Ethernet (ROCE)[8-9], hardware acceleration in NIC may be a suitable solution for eliminate the speed gap between CPU and network. Most of the current research put more efforts on data flow optimization between NIC and OS, in general without frame modification[10-12]. However, large segment offload (LSO) technology is an exception that can split a large frame into multiple small frames in NIC to decrease the burden of CPU[13]. Inspired by it, we hope to achieve acceleration by modifying or combining received frames propoerly.
Based on our previous work, we propose a new hardware acceleration method called small segment coalescing, SSC for short.To the best of our knowledge, SSC is the first hardware-based method on TCP frame coalescing during the reception. SSC can be seen as a vice-process of LSO that combines small received TCP segments belonging to the same TCP/IP connection, into a large TCP segment in NIC. If those frames can be coalesced according to SSC, CPU only needs to process one frame. Thus, the number of protocol processing and data copy will be reduced to 1, and the frequency of interrupts will also decrease, thereby consuming fewer CPU resources and resulting in the promotion of system performance.
There is a lot of research and techniques in accelerating TCP/IP processing, and the majority of them have been used in commercial-off-the-shelf (COTS) NICs such as TCP/IP offload engine (TOE), receive side scaling (RSS), interrupt moderation (IM), and large segment offload (LSO).
TOE is a typical technique adopting the dedicated hardware in NIC to achievement TCP/IP processing, but it is expensive to implement due to the modification on operating systems[14-15]. RSS is proposed and developed by Microsoft, which allows concurrent access to different queues by multiple CPU cores[16]. Instead of generating one interrupt for every incoming packet, NIC with IM generates an interrupt after some packets are received without delay[17]. Here we highlight key mechanism on LSO, focusing particularly on how it decreases CPU workload.
LSO reduces host CPU overhead by splitting, modifying, and assembling the transmitted frame while increasingthe outbound throughput of high-bandwidth network connections. This is done by queuing up large TCP packets, letting the Ethernet controller split them into smaller TCP packets to be transmitted onto the network. Without LSO, generally, when the host needs to send large blocks of data out over a computer network, the first thing CPU needs to do is break them down into smaller segments that can pass through all the network elements, such as routers and switches between the source and destination computers. Then, CPU has to package these segments as frames based on TCP/IP protocol and move them to NICs respectively.
However, when LSO works, CPU is able to hand over the large block of data to NIC directly, without other workloads, in a single transmit request. Then, NIC breaks the large TCP packet down into smaller segments, while adding the TCP, IP, ethernet headers and CRC checksum to each segment. Finally, NIC sends the resulting frames over the network. By adopting this initiative mechanism, LSO can significantly reduce the CPU workload.
Receiving process in more complex than transmitting process, because it is impossible for CPU to know when NIC would receive a frame and generate an interrupt[18].
Fig.1 illustrates the detailed interaction between NIC and the host during data receiving. Before forwarding the received frames, NIC has to get available Buffer description (BD) from the host over PCI-E, to know where the frame can be stored in the host memory (step 1). When an Ethernet frame is received from the network (step 2), the NIC transfers the frame to the host memory based on the address specified by BD via the DMA in PCI-E (step 3). Once the data frame is placed in the host memory, the NIC updates the BD, such as frame length, virtual LAN (VLAN) tag, and transmits the updated BD to the return ring (step 4). When the number of used BDs reaches the threshold or the interrupt timer times out, the NIC generates an interrupt to the CPU to indicate that the new frames have been received (step 5). Note that several interrupts have been eliminated into one interrupt based on interrupt moderation technology to reduce overheads. After receiving the interrupt request, the driver will access the return ring to read BDs to obtain the address and length of the received frames, and map the received frames into the protocol stack buffer (step 6). After the driver delivers these buffers to the protocol stack, it reinitializes and reallocates used BDs with new memory addresses for the next incoming frames (step 7). Finally, the frames are transferred to the final application after the protocol stack processing (step 8).
Fig.1 Diagram of receive data flow
Small segment coalescing, as it was called, is a technique that coalesces the small frames that belong to the same TCP/IP connection into a single large frame in NIC before moving to the host memory. It works on the receive side where NIC stores frames received from cable and forward them to the host after a series of processing. Because the upper software and driver are not involved in the whole process, SSC is transparent to the operating system, which means it can be integrated with NIC directly without any other system modification.
The TCP/IP frame usually contains the IP header, the TCP header and the payload, except for the Ethernet header and CRC checksum. The purpose of SSC is to combine multiple payloads into one payload packaged with a unified IP header and TCP header in a frame. In order to achieve coalescing, there is no doubt that the frames to be coalesced must belong to the same connection. SSC adopts five elements to identify one connection: source IP (SIP), destination IP (DIP), source port (SP), destination port (DP), and protocol code (PR) respectively. These five elements are usually called a 5-tuple and can be used to identify a unique connection in the network[19]. SIP, DIP and PR are in the IP header, and SP, DP are in the TCP header. What should be noted is that SSC is only applicable to TCP protocol, therefore frames with other protocols, like UDP, would not be coalesced by SSC.
Basically, TCP protocol adopts sequence numbers to stamp every byte of payload and acknowledgement numbers to confirm every received byte. Hence, after the connection establishment, the payloads in TCP frames would be numbered sequentially. In this case, except for the TCP checksum, the differences between two TCP headers are the sequence number and Acknowledgement number, if the control flag stays the same. According to TCP/IP protocol and Ethernet frame format, SSC combines multiple payloads of TCP frame in sequence into one payload, while recalculating and modifying the relevant information in the header, such as sequence number, acknowledgement number, total length and checksum. Then SSC adds TCP header, IP header, Ethernet header and tail to the combined payload, respectively. Thus, SSC combines multiple small TCP frames into one larger TCP frame.
Fig.2 illustrates the SSC mechanism by comparing it with the normal scenario. From Fig.2, we can see that SSC is a technique that can be achieved in NIC totally. Although frames are modified in NIC, they are still compatible with TCP/IP protocol, which means SSC is transparent to the upper protocols. This is because TCP protocol adopts the sliding window mechanism to control transmission, thus multiple frames can be transmitted simultaneously without waiting for acknowledgement. In other words, the sliding window mechanism allows SSC to modify TCP segments if only the TCP/IP header is correct after the coalescing. As for the TCP/IP stack in the receive side, receiving multiple continuous TCP segments with same connection equals receiving one larger TCP segment that integrates their payloads in sequence. Therefore, if the sequence number of received frames is consecutive and they belong to the same TCP/IP connection, it is possible to coalesce them via SSC.
Fig.2 Overview of SSC mechanism
In addition, SSC also needs to take consider frame size. The maximum frame length of standard Ethernet is 1 518 bytes (without VLAN tag), so frame size, theoretically, cannot go beyond this boundary after coalescing. Fortunately, jumbo frame technology allows the maximum Ethernet frame length to be 9 600 bytes or even bigger, depending on the difference between each manufacturer, and almost all NICs support this feature at present[20-21]. In this circumstance, SSC does not need to care about the standard ethernet frame length because these normal frames can be coalesced as a jumbo frame. It should be noted that the frame length after SSC coalescing cannot exceed the length of the jumbo frame. However, SSC doesn’t work on received jumbo frames. In short, SSC is able to generate jumbo frames, but it is only used in normal TCP frames.
In order to achieve coalescing, except for 5-tuple, SSC has to sample and record some basic information and the state of each received TCP/IP frame, such as sequence number, acknowledgement number and data offset. Besides, in order to make sure that it would not add any extra delay in delivering the frame to the host, SSC also needs to monitor the amount of used BD and relevant timer. Overall, for the frames belonging to the same connection, SSC would decide whether to start coalescing and when to stop it based on the status and information collected from both NIC and received frames. This will be discussed in detail in next section.
Based on the fundamental description of SSC in the last subsection, there are four basic conditions for frame coalescing: TCP packets, same connection, consecutive sequence number and proper frame length, respectively. In fact, there are additional constraints for SSC, which will be discussed later. Here, we mainly analyze its feasibility in real situations based on these four conditions.
In order to find out how much coalescing is possible in the real network, we have collected several sets of network frames and analyzed them. We selected five typical applications: instant messaging (IM), file transfer protocol (FTP) transmitting, Web browsing, online game playing and online video playing. Then we run them respectively on a Windows-based computer equipped with a quad-core 2.3 GHz processor and one 1 Gbit/s ethernet NIC. At the same time, we used ethernet capture tools to collect 10 000 network frames randomly on the receive side. In addition, based on the four conditions mentioned above, we developed a script to find out how much coalescing is possible by analyzing captured collections. The results, as seen in Tab.1, show that about 40% to 65% of frames can be coalesced by SSC, and the average is 49%. FTP has the greatest possibility among the five kinds of application due to the smaller number of TCP connections. Besides, the results of IM confirm further that almost 50% coalescing can be achieved even when there are a large number of TCP connections on received frames. Because the frames may be transmitted through many routers, and switches and cause out-of-order phenomenon when NIC receives them, the coalescing possibility of Web application is the smallest in tests, but still has 39%. In summary, according to the statistical results, the study of SSC is not only feasible, but also makes sense for accelerating TCP processing.
Tab.1 Statistical results on coalescing possibility
As mentioned in Section 2, we can use SSC to coalesce multiple frames that belong to the same TCP connection in NIC to accelerate TCP/IP processing. However, there are still some technical details that need to be depicted. Specifically, we mainly focus on the following four issues: ① when to start coalescing; ② how does SSC coalesce; ③ when to finish the existing coalescing; ④ how to make sure that there is no extra delay in delivering frames to the host. In addition, the specific description of SSC is based on the following preconditions: ① TCP/IP frame only; ② No IP header options are set; ③ Valid TCP/IP Checksum; ④ TCP payload > 0 bytes; ⑤ No TCP header options are set.
In order to start coalescing, first of all, the 5-tuple of the received frame, where SYN flag in control flags of TCP header equals to “1”, would be recorded in NIC. A received frame with SYN flag indicates that the terminal device wants to establish a new TCP connection. Hence, it is a symbol of beginning the TCP connection. The recorded 5-truple would be saved in a memory block called coalescing table. Then, after host interrupt, if received frame meets coalescing criteria, SSC starts coalescing. The coalescing criteria are that the 5-truple of incoming frames is the same as that of the coalescing table, and its SYN, FIN, RST and URG flags are not set in control flags. The reason that coalescing must be started after host interrupt is for delay elimination, which will be discussed later. What is notable is that SSC doesn’t care about frame length at the beginning of coalescing, even though it is beyond 1518 bytes, because SSC can combine multiple frames into a jumbo frame. In other words, SSC doesn’t take frame length into consideration when making a determination on whether to start coalescing.
If an incoming frame satisfies coalescing criteria, then SSC would remove its TCP/IP header and store its payload and relevant information in NIC. The relevant information includes TCP sequence number, acknowledgement number, window size and control flags. This information is used for recalculating the new TCP/IP header when coalescing is done, and we will describe this later. Similarly, when the next frame comes and if it meets coalescing criteria as well, TCP/IP header would be stripped from the frame, and the relevant header information would be saved. Then, its payload would be attached to the end of previous frame’s payload. SSC adopts this mechanism to achieve coalescing, and the cycle repeats itself.
What should be noticed is that SSC only supports continuous TCP frames, which means it can only combine multiple payloads of TCP frame in sequence into one payload. Hence, out-of-order TCP frames would force SSC to stop coalescing.
Based on the working principle of the TCP sequence number and acknowledgment number, SSC calculates the sequence number of the next continuous frame by the current sequence number and the length of the payload. If the sequence number of the next frame is the same as the result of SSC calculation, the frame is consecutive; if they are different, it means the out-of-order occurs. The sequence number of the next frame is calculated as follows.
NSN=CSN+Lpayload=CSN+(LIP-Lheader)=
CSN+(LIP-40)
(1)
where NSN is the sequence number of the next frame, CSN is the sequence number of the currently received frame,Lpayloadis the length of payload,LIPis the length of IP packet, andLheaderis the total length of TCP and IP headers.
In this subsection, we mainly describe subjective reasons for stopping coalescing and how SSC recalculates the new header.
First of all, SSC has to control the maximum length ofthe combined frame to abbreviate the maximum frame length (MFL). Because SSC is implemented based on jumbo frame, the maximum length of the jumbo frame is a key factor that determines how large the combined frame can be. The maximum length of jumbo frame supported by different vendors is different but usually greater than 8 000 bytes. MFL is configurable and cannot exceed the maximum length of the jumbo frame. Therefore, SSC monitors the current segment size to judge whether it is beyond MFL when the next frame comes during the whole process of coalescing. If it happens, SSC doesn’t combine this incoming frame and stops coalescing.
The second constraint is coalescing threshold. As mentioned before, when the number of used BDs in receive ring is equal to the interrupt threshold set in the NIC register, a host interrupt occurs. Hence, SSC must make sure that coalescing has been finished before the interrupt happens. For this purpose, SSC defines a coalescing threshold which is smaller than interrupt threshold. Thus, when the number of used BDs is equal to the coalescing threshold, SSC stops coalescing. The specific value of coalescing threshold should be determined by the actual circumstances that SSC was integrated into.
The third constraint is similar to the second one. In order to guarantee that coalescing can be done before the interrupt timer expires, SSC defines a smaller timer called the coalescing timer. Its value also needs to be determined based on the actual applications. Like the interrupt timer, the coalescing timer is reset to the configurable value and starts counting down after every host interrupt. When the coalescing timer stops, SSC stops coalescing.
In summary, SSC stops coalescing whenever one of the above criteria is met.
After coalescing, SSC needs to recalculate a new TCP/IP header to package it as a complete frame. Based on the preconditions, the TCP/IP header is 40 bytes in total. Because these frames come from the same connection, the basic fields in the TCP/IP header, such as 5-tuple, version, type of service, protocol and so on, remain unchanged. Besides, the other fields in the TCP/IP header need to be recalculated or modified by SSC, as shown in Fig.3.
Fig.3 Modification of IP, TCP headers
In fact, the mechanism on how to eliminate delay has been permeated in the above description. Based on start criteria and stop criteria, SSC can only start coalescing after a host interruption and stop coalescing before the next host interruption. In other words, SSC only works between two consecutive host interruptions, and this period is called coalescing period. Furthermore, in order to guarantee that coalesced segments can be packaged as a complete frame in time, SSC can only conduct coalescing once for a TCP/IP connection during a coalescing period. In this way, SSC guarantees that there is no extra delay in delivering the frames to the host, as illustrated in Fig.4. From Fig.4, we can see that SSC utilizes coalescing periods to achieve all operations, including coalescing and frame movement. Thus SSC is totally transparent for the host, and there is no delay caused by SSC in receiving process.
Fig.4 Illustration of delay elimination
Frames belonging to different connections can be coalesced simultaneously, which means SSC supports parallel coalescing. Because TCP connections are independent of each other, one coalescing process would not affect other running processes in SSC. Theoretically, the bigger the number of parallel coalescing adopted, the greater the SSC performance we can get. Ideally, all TCP connections should be processed by SSC in NIC. However, considering the hardware complexity and consumption, the parallel number is limited. Based on the practical requirement, predefining connections that the user is more concerned about is recommended for SSC. This not only defines the parallel number but also accelerates coalescing due to the elimination of unnecessary processes and information records.
Based on our previous work, SSC was implemented and integrated with LCE5718, which is a totally self-designed dual-port gigabit ethernet controller. LCE 5718 consists of dual IEEE 802.3 MACs, dual triple-speed ethernet transceivers (PHYs), dual SerDes transceivers, NC-SI, PCI-E and on-chip memory in a single chip. it has been implemented with 65 nm CMOS technology and taped out[18].
The defaults of the basic parameter are determined based on the design of LCE5718. Specifically, LCE5718 supports jumbo frame as large as 9 KB; there are 256 BDs in a ring, the interrupt threshold is 200, and the interrupt timer is 128 μs. Therefore, after a series of theoretical analysis, the defaults of MFL, coalescing threshold, coalescing timer, and parallel number are set as 1 518 bytes, 100, 64 μs and 1, respectively. Note that the value of coalescing threshold is half of the interrupt threshold, and the coalescing timer is half of the interrupt timer. These parameters of SSC are all configurable via registers. Besides, as the Design Under Test (DUT), the LCE5718 with SSC is working at 1 Gbit/s speed with the 125 MHz system clock. All simulations are conducted in the Cadence NC-SIM environment, and Ethernet verified IP (VIP) is used to generate TCP frames with 500 bytes payload continuously to DUT.
As for the configuration of SSC, the most challenging issue is how large the coalescing threshold and coalescing timer should be to make sure there is no delay while coalescing more frames. In other words, they are critical for both performance and delay. In order to obtain a better performance, their values should be as large as they can be. But for delay issues, they have boundaries, namely the maximum values, which are what we care about in the simulations. Hence, two sets of simulations are designed to find out their maximum values in our design.
Tab.2 and Tab.3 show the maximum values of coalescing threshold and coalescing timer under different simulating configurations respectively. It can be seen from the tables that both values will decrease with the increase of MFL, especially when parallel number is 4. This is because the larger the frame SSC generates, the more time it needs to transmit it to the host. In this case, SSC has to get the coalescing done as early as possible by decreasing coalescing threshold and coalescing timer. Besides, coalescing threshold is inversely proportional to parallel number, and so does coalescing timer. This also results from the larger transferring time.
Tab.2 Simulation results of the maximum coalescing threshold in different situations
Tab.3 Simulation results of the maximum coalescing timer in different situations
The simulation results have proven that choosing half of interrupt threshold and interrupt timer as the default values of coalescing threshold and coalescing timer respectively is reasonable. Although the performance may not be optimal by default, it can be guaranteed that there will be no delay. Because interrupt moderation is a commonly available feature in current NICs, this configuration of default values is recommended when parallel number and MFL are configured as the minimum values. Otherwise, the default values need to be determined by simulation or calculation to ensure no delay. In general, MFL and parallel number will affect the maximum values of interrupt threshold and interrupt timer, and they are inversely proportional to each other.
To evaluate SSC performance, the FPGA prototype platform is set up, which has been used to verify and evaluate LCE5718. This platform consists of an FPGA test board and a PC. In FGPA test board, the LCE5718 NIC with SSC is implemented on a Stratix V FPGA (Device: 5SGSMD4E2H29I3). The motherboard was equipped with an Intel I7 processor and a 4Gb DDR3 RAM while running Wind River VxWorks operating system. The interface between FPGA and motherboard is PCI-E, and this FPGA test board actually can be seen as an independent computer. The PC acts as the terminal of data transmission. The PC and FPGA test board are linked by the UTP-5 cable and work at 1 Gbit/s speeds.
In order to evaluate the optimal performance of SSC in an ideal situation where all TCP frames in receive side are continuous, packet sender, an open source utility to allow sending and receiving TCP frames, is used to generate the test frame set[22]. During the experiments, all the trials are conducted on the platform specified above.
The first set of experiments was conducted to find out the effects of MFL on SSC performance. The test frame set consists of 10 TCP connections, and each connection transmits 2 000 frames with 500 bytes TCP payload. Besides, coalescing threshold and coalescing timer remain the default in the experiments. The bars in Fig.5 are the total numbers of NIC interrupt obtained by the traditional method without SSC, in addition to SSC with different parallel numbers, 1, 2 and 4 respectively. Thex-axis in the graph shows different MFL ranging from 1 518 bytes to 8 KB. Moreover, Fig.6 shows the corresponding CPU utilization. Across all conditions, it can be seen that both the interrupt number and CPU utilization realize various degrees of reduction when SSC is adopted. When MFL was configured as 2KB, no matter how large the parallel number is, SSC performance maintained at the same level with that of the normal situation. However, CPU utilization declined gradually with the increase of MFL and remained stable, around 16%, after 6KB. As to different parallel number, trends of CPU utilization are the same. This is because coalescing timer or coalescing threshold has become a critical factor to stop coalescing with MFL beyond 6KB in current experiments. Therefore, increasing the frame length blindly can’t bring continuous improvements on SSC performance.
Fig.5 Comparison on interrupt number with different MFL
Fig.6 Comparison on CPU utilization with different MFL
In addition, it is obvious that SSC with 4 parallel coalescing obtained greater performance advantages than other cases under the same situation. This agrees well with the previous theoretical analysis. What is notable is that when parallel number is 1, the number of interrupts stayed constant around 153, but CPU utilization had a significant drop, especially when MFL was configured as 6 KB, decreasing from 27% to 19%. This is because SSC reduces the overhead of frame copy and protocol processing, although the number of interrupt is unchanged due to the interrupts moderation.
Fig.7 CPU utilization with different coalescing threshold in different MFL configuration
The coalescing threshold is a critical parameter on SSC; so in the next set of experiments, coalescing threshold varies from 10 to 200, equaling the interrupt threshold, to evaluate its impact on performance. The same test frame set is used in this set of experiments, and coalescing timer remains the default (64 μs). Experiments were carried out with different MFL of 2 KB, 4 KB, 6 KB and 8 KB and parallel number of 1, 2 and 4 respectively. The experimental results in Fig.10 show that the CPU utilization decreases with the increase of coalescing threshold in all cases, and the trend becomes more and more significant as MFL and parallel number increases. Specifically, when MFL is 2 KB and parallel number is 1, the CPU utilization just decreased by 2% no matter how large the coalescing threshold is. According to the simulation results, the maximum value of coalescing threshold can be almost 190 in the current situation. However, the performance of SSC did not increase linearly with the increase of coalescing threshold but remained stable as shown in Fig.7. This is because MFL is only 2 KB and SSC has to stop coalescing even if coalescing threshold has not be reached in the coalescing period. In this case, even if coalescing threshold exceeds the maximum value obtained from the simulations, SSC can still work normally without any delay issues. The experimental results also prove it. By comparing the transmitted and received data, there is no data loss even when coalescing threshold exceeds the maximum. The situation is similar when MFL is 4 KB.
When MFL gets larger, in general, CPU utilization becomes significantly lower as coalescing threshold increases, and then remains stable. What is notable is that the CPU utilization continues to decrease as coalescing threshold increases under the following three situations: MFL and parallel number are 6 KB and 4, 8 KB and 2, and 8 KB and 4 respectively. However, according to the simulation results, the maximum values of coalescing threshold for these three cases are 124, 134 and 100, respectively, which means the SSC may work in a wrong configuration mode when coalescing threshold is greater than the maximum. The comparison results between the transmitted and received data indicate that some of the data have been lost. When SSC does not have enough time to transmit the coalesced frame to the host without delay, the frame would be discarded. This is the reason why the CPU utilization can keep going down when SSC works in invalid configurations. In addition, it also can be indicated that the frame size has not reached MFL in these three cases. In other words, coalescing threshold becomes the only factor to stop coalescing.
The above analysis shows that when coalescing threshold is smaller, it is the key factor to determine SSC performance; when coalescing threshold is increasing but not exceeding the maximum, MFL may become another factor to suspend the coalescing in advance, thereby affecting SSC performance.
The last set of experiments was conducted to evaluate the impact of coalescing timer. The same test frame set is used in this set of experiments, and coalescing threshold remains the default (100). The experimental results are shown in Fig.8. Generally speaking, even if MFL is 2 KB or 8 KB, the CPU utilization decreases slowly as coalescing timer increases and then remains stable. The declining trend of CPU utilization with 8 KB frame length is more obvious than that with 2 KB. This is because 2 KB of MFL is so small that SSC has to stop coalescing. From Fig.8, it can be seen that when the coalescing timer is greater than 60, it will no longer have an impact on SSC in all experimental cases, which indicates that MFL and coalescing threshold have become the dominant factors. Actually, this is reasonable because the coalescing timer is designed for some special situations, such as network faults. Therefore, compared to MFL and coalescing threshold, the coalescing timer has less influence on SSC performance in normal circumstances.
Fig.8 Comparison on CPU utilization with different coalescing timer when MFL=2 KB and 8 KB
Aiming at accelerating TCP/IP processing of receive side, a hardware acceleration method called small segment coalescing is proposed in this paper. Through SSC, frames with the same TCP connection coalesce into a larger frame in NIC to eliminate the cost of data copy, interrupt response and protocol handling. The analysis on collected network frames proves that more than 50% of frames can coalesce. Next, SSC was implemented based on our previous work and simulated by EDA tools to verify correctness and to find out the maximum of parameters in corner cases. Finally, the FPGA prototype platform was set up to conduct the experiments in the practical environments. The experimental results showed that MFL and coalescing threshold have direct and obvious effects on CPU utilization and they constrain each other. The corresponding simulations and experiments have proved that their values should be determined according to the specific situation. In the best case of experiments, CPU utilization can be reduced by 12% when MFL, coalescing threshold and parallel number are 6 KB, 120 and 4, respectively. Coalescing timer has less impact on performance in normal network environments.
The future work emphasizes on resolving issues of out-of-order frames during coalescing and implementing the ASIC chip based on LCE5718. It is worth improving SSC to handling out-of-order cases that need to be processed by the stack because it is very common in the real network. SSC with a reordering feature will further improve its performance and adaptability. In addition, in order to implement the ASIC chip and make it available for more users, we have to optimize hardware architecture and design, such as frame filter method, BD scheduling algorithm, arbiter policy, and perform more detailed and more comprehensive verification.
Journal of Beijing Institute of Technology2018年3期