This is an English translation of the research paper "一种高并发网络环境下快速流表查找方法", published in Acta Electronica Sinica 45(4), .
Posted .
Date received: 2015-03-27; date revised: 2016-05-05; managing editor:
孙瑶 (Sun Yao)
Fund project: Strategic Priority Research Program of the Chinese Academy
of Sciences (No. XDA06030200); National Natural Science Foundation of
China (No. 61402474); National 242 Information Security Program
(No. 2015A087); Guangdong Industry-University-Research Cooperation
Project “Guangdong Health Cloud Security Academician Workstation”
(No. 2016B090921001)
In order to improve the lookup speed of flow tables in a high-speed network environment, this paper first analyzes the characteristics of traffic on OC-192 backbone links. Research shows that backbone links not only have high concurrency and high arrival rates, but also have good network locality in an appropriate cache window. Based on these characteristics and the principle of locality, this paper implements a fast flow table lookup method by adding constant auxiliary space overhead on top of a naïve hash table structure. Theoretical analysis and experiments on real network datasets show that this method can reduce flow table search length by 20.2% and flow table access time by 17.1% compared to existing methods.
In high-speed network environments, efficient connection management has become a key module of existing network traffic processing systems (e.g., intrusion detection systems, traffic billing systems, etc.)[1, 2]. As shown in Figure 1, typically, the architecture of a traffic processing system is mainly divided into three modules: traffic acquisition, connection management, and service processing. Connection management provides flow traceability for service processing[3], including the three operations: search, update, and delete. In order to accurately record each connection, the connection management module must maintain a flow table (or session table), in which each flow table entry tracks a connection in the network, and is responsible for recording the connection’s ID, status, and other related information, of which the connections ID is globally unique and is generally made up of a 5-tuple in the TCP/IP header: FID {sip, dip, sport, dport, ip protocol}.
Most of the data structures used in the connection management module are hash tables, and hash collisions are resolved by chaining[4, 6]. This is because hash tables support lookup, insertion, and deletion operations with complexity close to constant time when the total number of flows in the network approximates the length of the table[7]. Existing traffic processing systems use a single-packet scheduling policy. The packets are first cached in the NIC buffer and then sent to the connection management module in order of arrival in the buffer to perform connection status update and maintenance operations. In a high-speed network environment, single-packet processing not only incurs a large amount of function callback overhead, but also leads to performance bottlenecks in flow table access. As the number of concurrent connections increases, the size of the flow table grows. Due to constraints on hardware resources and the limitations of the hash table structure itself, the number of hash table slots needs to be preset, and dynamically adjusting them is very difficult[8]. The growth of collision chains causes the lookup efficiency of single-packet flow tables to drop sharply. In existing high-speed networks with 10 Gbps traffic, the packet rate reaches 10 Mpps or even higher, and the vast majority of packets in the network need to perform a flow table lookup operation. The frequency of flow table lookups is equivalent to the packet arrival rate. The efficiency of flow table lookup has become one of the important performance metrics for traffic processing systems.
According to the 2014 traffic report[9] by the authoritative organization Sandvine, multimedia traffic such as files, audio and video account for more than 70% of traffic in backbone links in the Asia-Pacific region. Meanwhile, in backbone links, TCP traffic accounts for more than 90% of total traffic[10]. The traffic is characterized by high concurrency and slow update, and network traffic is localized to a certain extent[2]. This means that over a period of time, multiple packets from the same connection will arrive one after another.
Based on the traffic characteristics of backbone links, for the problems faced by connection management modules in single-packet processing, this paper firstly proposes a quantitative indicator of traffic locality to quantitatively analyze the locality features of traffic in backbone links. Second, on top of the traditional hash structure, we have added a constant amount of auxiliary overhead space, and designed and implemented a fast flow table lookup (FFTL) method to reduce flow table access overhead. FFTL is executed in three stages: firstly, through an efficient traffic locality polymerization algorithm, the packets in the buffer are grouped by flow ID, so that packets of the same connection are classified into a group; secondly, a threshold scheduling policy is used to schedule the packets that have been grouped; finally, according to the scheduling sequence, each group of packets is sent to the connection management module in bulk to perform operations such as flow table lookup and connection state maintenance. Experimental results show that this method effectively reduces the average search length of flow tables in high-concurrency networks and improves the efficiency of network traffic processing systems.
Current flow table lookup operations are divided into three types of implementations: hash table, Bloom filter[11], and content-addressable memory[12]. For a flow table with a hash table structure, a hash table lookup consists of two steps: calculation of a hash value and comparison of a collision chain. The worst-case performance of handling hash collisions by chaining is poor. When all N keywords are inserted into the same slot, it results in a linked list of length N, which has a worst-case search length of O(N)[7].
Based on this, a lot of work has been focused on making the length of collision chains in each slot as balanced as possible, so as to ensure that the average search length is as close as possible to the best-case O(1 + α), with α being the load factor. To achieve this, a good hash algorithm is required. Reference [4] states that although it is possible to equalize the distribution of the lengths of all collision chains in a hash table with the help of sophisticated cryptographic hashing methods (such as MD5, SHA-1), good hash functions usually consume a large amount of CPU, and even a hardware implementation of the hash operation unit requires 64 clock cycles[13].
Compared with the previously mentioned single hash, the effect of multiple hashing is better[14]. A Bloom filter is a kind of multiple hash, which is usually combined with high-speed hardware (such as TCAM, FPGA[15], etc.) to provide efficient content lookup[16]. References [17] and [18] state that a Bloom filter is commonly used to represent a set to provide member query operations, with probability of error in its query results. Moreover, Bloom filter query results only have two values: yes and no, which does not satisfy the need for connection state maintenance in traffic processing systems. Reference [4] implements a variant of the Bloom filter, FCF (Fingerprint Compressed Filter). Each hash slot of FCF contains d sub-tables of fixed size. Through d mutually independent hash functions, the hash value of a connection is computed d times and it is finally inserted into the shortest of the d sub-tables. However, this makes each lookup search d collision chains, which imposes a large lookup overhead in packet-dense backbone networks, and it uses a single bit for timeout processing, which is a slot-level processing method. Although FCF gains the advantage of fast access to SRAM by compressing the FID, its scalability is limited by the capacity of SRAM.
In terms of optimizing lookup operations with the help of network locality, reference [2] points out through measurement that the backbone network is characterized by high concurrency, with a concurrency of two million, and only about 20,000 new connections appearing per second, which accounts for 1% of total concurrency, and that the backbone network is updated slowly, with a certain degree of locality. Based on this, reference [2] uses FPGA and SRAM to implement a high-speed cache on the flow table to speed up access. Limited by the circuit complexity of FPGA and the capacity of SRAM, it mainly processes the first few bytes of the packet payload and is specifically applied to protocol identification systems. It cannot solve the problem of full packet payload processing in DPI systems. In addition, TCAM (Ternary Content Addressable Memory)[12, 19] is also used to speed up the lookup operation. However, TCAM is too expensive and consumes a lot of electricity, and the size of the flow table is also limited by storage capacity. In a high-concurrency network environment, a large number of active connections will have to be replaced due to traffic fluctuations and bursts, resulting in system detection misses.
In this section, we first briefly introduce classical connection management approachs, giving their basic lookup complexity and representation; then we analyze the locality of real traffic.
A naïve flow table (NFT) is organized using a hash table, as shown in Figure 2. The hash slots of the hash table are implemented using a fixed number of arrays with consecutive addresses, and each slot uses a linked list to handle hash collisions. Each connection is identified with a globally unique flow label, FID, with the packet for flow i labeled as FIDi.
Whenever a packet x arrives, it is processed according to Algorithm 1.
The search method traverses the entire collision chain, comparing the FIDs one by one, which determines the search overhead of the whole process. To illustrate the naïve average search length (NASL) of the NFT, if the number of slots in the NFT is defined as M, and the number of elements that have been inserted into the NFT as N, then the load factor of the NFT at this time is
i.e., the average number of elements in each collision chain. From the proof in reference [7], it can be seen that the expected number of comparisons for a single lookup in the case of uniform hashing is
From this we can determine that the average number of lookups in a flow table is proportional to the number of slots in the hash table and inversely proportional to the total number of elements in the flow table, provided that the hashing is uniform.
Locality means that after one packet arrives for a particular connection, packets for that connection will arrive again in the near future. In naïve connection management, almost every packet requires a traversal of the collision chain. With locality, we can aggregate multiple packets in the same connection and look up the flow table only once, avoiding redundant collision chain traversals. To this end, we studied the locality features of real traffic on a OC-192 link in 2014 (with a 14% link utilization) provided by CAIDA[20], a dataset that has been anonymized.
Figure 3 depicts packet arrivals for the same connection as a percentage of all packets, for a total of about 21 minutes’ worth of traffic, averaged over each minute. It can be seen that due to the high concurrency, packets for different connections are separated from each other, and the locality that can be observed is 8.4% on average.
Based on the traffic characteristics, we designed a locality polymerization (LP) method, which quickly groups packets for the same connection through a cache window, so as to reduce the number of flow table traversals. We first give a quantitative indicator of the degree of locality, defined as follows: Arrange the packets in order of arrival into a sequence S. For any subsequence s of length K:
where ai is the flow label, 1 ≤ i, and j ≤ K.
We divide the sequence s by flow label, and record the total number of divisions as C, indicating that there are C different connections in s. Let
ρ represents the proportion of connections contained in K data packets. Let
μ is the degree of locality of a packet sequence of length K. In order to visualize the degree of locality in the network more intuitively, we measured the degree of locality for different values of K in the range [32, 352], as shown in Figure 4.
As can be seen in Figure 4, the degree of locality of the OC192 backbone link reaches 35% when K is 32. Due to the characteristics of high concurrency and slow update of the backbone network, combined with an appropriate cache window, good locality can indeed be obtained. We designed a locality polymerization (LP) method on top of the naïve flow table based on these characteristics and implemented the fast flow table lookup method on this basis.
In this section, we first describe the specific implementation of the locality polymerization method and the fast lookup method, and discuss and improve the post-polymerization scheduling policy, and then analyze the theoretical time complexity of the fast lookup method.
In order to implement locality polymerization so as to exploit traffic locality, some auxiliary data structures are needed. Based on a naïve hash table, a buffer window and a partition table (PT) to assist in the flow partitioning operation are added, and a queue Q is used to index the groups that have completed the partitioning operation.
As shown in Figure 5, the PT divides the packets for different flows and the size of the index queue Q is set to an amount K equal to the buffer size. Whenever a packet x arrives, it performs an operation as in Algorithm 2.
fid←getFID(x.sip,x.dip,x.sport,x.dport,x.ipproto)
hash_value←hash(fid)
j←hash_value%mod_pt
cached_num←cached_num + 1
if PT[j] = NULL then
insert x into PT[j]
Q.enque(j)
else if PT[j].first.fid = x.fid then
insert x into PT[j]
else
call commit
insert x into PT[j]
Q.enque(j)
end if
if cached_num = K then
call commit
end if
function commit
for all j in Q,do
x←PT[j].first
ptsearch_items(NFT[j],x.fid)
if p = NULL then
p←new flow_items(fid)
Insert(NTF[j],p)
end if
for all x in PT[j],do
update(p,x.state)
end for
PT[j]←NULL
end for
Q.clear()
return TRUE
end functionFrom lines 11 and 16, we see that the commit method will be executed only when the number of cache entries reaches K or when a collision occurs in the PT. Because the value chosen for K is relatively small, and the PT can generally be chosen within the range of 1% to 10% of the real table, the probability of a collision is extremely low, and the utilization of locality will not be affected.
From lines 19–21, we know that for each bucket of PT[j], only the first packet of the PT[j] chain performs the collision chain traversal of the real NFT flow table each time, and after that, the other packets in the PT[j] bucket only need to perform the flow state update in turn. In this case, the time overhead of a single lookup is shared equally among multiple packets, thus reducing the average search length. The hash value calculation and FID extraction have already been calculated in lines 1–2, so there is no need to repeat the calculation.
Further improvements in the scheduling policy: From line 19 of Algorithm 2, we know that every time a commit occurs, all cached connections in the index queue are sent to the connection management module. Due to the unevenness of connection arrivals, some connections with only one cached packet are also committed. It is clear from Equations (3) and (4) in Section 3.2 that reducing the occurrence of this situation can improve the effectiveness of polymerization. As a result, we introduce a threshold clearing policy for the commit method in lines 18–32, prioritizing connections with the number of cached packets reaching the historical average for each commit. The following adjustments are made to the locality polymerization module in Figure 5: ① Add a secondary index queue Q2. When the number of cached packets of a connection in the primary queue Q1 reaches a threshold, the connection will be transferred to the secondary index queue for indexing. ② Modify the unidirectional indexing relationship between Q and PT to a bidirectional indexing relationship. ③ Committing connections in Q2 on a priority basis will result in connections in the lower-priority Q1 not being scheduled for a long period of time. For this reason, a starvation avoidance mechanism is introduced by adding a collision counter C1 and a Q2 scheduling accumulator C2. The adjusted module is shown in Figure 6.
When the polymerization operation collides, C1 increases by 1; when the buffer window is full and C1 is greater than 0, C1 decreases by 1. C2 is used to count the cumulative number of packets committed by Q2: when the buffer is full or when a collision occurs, connections in Q2 are committed first. When the number of packets in Q1 reaches K, or C1 reaches a threshold, or C2 reaches a threshold, connections in Q1 are committed, and at the same time, the counter is cleared. Through this improvement, the polymerization effect can be improved and the scheduling starvation of Q1 can be effectively avoided. The details are shown in Algorithm 3.
fid←getFID(x.sip,x.dip,x.sport,x.dport,x.ipproto)
hash_value←hash(fid)
j←hash_value%mod_pt
cached_num←cached_num + 1
if PT[j] = NULL then
insert x into PT[j]
Q1.enque(j)
else if PT[j].first.fid = x.fid then
insert x into PT[j]
if PT[j].size > delt then
move PT[j] from Q1 to Q2
end if
else
move PT[j] from Q1 to Q2
call commit(Q2)
C1←C1 + 1
insert x into PT[j]
Q1.enque(j)
end if
if cached_num = K then
C1←max(C1 - 1,0)
call commit(Q2)
end if
if Q1.pkt_size = K or C1 > delt or C2 > 10 * Q1.pkt_size then
call commit(Q1)
C1←1
C2←20
end if
function commit
if Q is Q2 then
C2←C2 + Q2.pkt_size
end_if
for all j in Q,do
x←PT[j].first
ptsearch_items(NFT[j],x.fid)
if p = NULL then
p←new flow_items(fid)
Insert(NTF[j],p)
end if
for all x in PT[j],do
update(p,x.state)
end for
PT[j]←NULL
end for
Q.clear()
return TRUE
end functionFollowing the steps in Section 4.1, we divide the computation of the average search length of the fast lookup method (FASL) into two parts; i.e., the average search length of the partition table, PASL, and the average search length of the naïve flow table after locality polymerization, NASL′.
First, we analyze the search length in PT. Since the size of the cache window is K, and the number of different connections in it is C = ρK, after the first packet of each connection is determined, all the other packets of that connection require one FID comparison (corresponding to Steps 2 and 3 in Section 4.1); i.e., the number of times a FID comparison operation occurs in K packets is:
Next, let’s take a look at the average search length in NFT. After polymerization, each connection only needs the packet at the head of the chain to perform the lookup operation (corresponding to step 3 in Section 4.1), so:
Thus, the average search length of K packets using the fast lookup method is:
Finally, we get the relational expression between the average search length and the degree of locality as follows:
From Equation (9), we can see that FASL is always smaller than NASL when NASL is greater than 1; i.e., the fast lookup method does reduce the average search length. Compared with the naïve hash table, the auxiliary space of constant overhead introduced by the fast lookup method is O(K + LT).
In order to verify the effectiveness of FFTL, we use two evaluation metrics—the average search length and the average access time of the flow table—to evaluate the experimental results of FFTL in comparison with the connection management module of the open-source intrusion detection system Snort. We categorize the experimental scenarios according to the traffic source and duration of the dataset as follows.
| Scenario | Source | Average bps | Maximum bps | Duration |
|---|---|---|---|---|
| C1 | OC-192 | 6.55G | 7.2G | 20min |
| C2 | school gateway | 9.2G | 9.4G | 20min |
| C3 | school gateway | 9.1G | 9.6G | 4h |
The system environment used for the experiment is Redhat 6.3, kernel 2.6.32-279, Intel Xeon 8-core 1.8 GHz CPU, 32 GB of RAM, and Intel 82599 10 GbE NIC.
Since the C1 dataset contains only IP and TCP header information and no other payload data, the time dimension cannot be evaluated, so only experiments on the search length were performed.
Figure 7 compares the average search length of the Snort connection management module, FFTL, and FFTL (MFFTL) of the introduced threshold scheduling policy in C1, where the value of K for both FFTL and MFFTL is set to 256. The results show that the fast lookup method can effectively reduce the average search length of the flow table. Figure 8 compares the average search length of MFFTL when K is set to 64, 128, and 256, which decreases to a certain extent as the window increases. Considering that an increase in the value of K will lead to an increase in packet processing latency, the choice of the cache window size should take into account a tradeoff with the system’s tolerance for latency.
In C2 and C3, we conducted comparative experiments in two dimensions, namely the flow table execution time and the flow table access length. In order to use the same experimental parameters across multiple datasets, the window K was uniformly set to 256.
As can be seen from the experimental results in Figures 9 to 12, both FFTL and MFFTL have significant decreases in the two dimensions of flow table access time and search length over both short and long time scales, with FFTL decreasing by about 14.3% in the time dimension and 16.7% in the search length dimension, and MFFTL decreasing by about 17.1% in the time dimension and 20.2% in the search length dimension.
By measuring backbone network traffic and analyzing the locality features of traffic in high-speed backbone links, this paper presents a quantitative index of traffic locality and an efficient traffic locality polymerization method. On this basis, a fast flow table lookup method is designed and implemented, the post-polymerization scheduling policy is improved, and a threshold scheduling policy and a starvation avoidance mechanism are introduced. In addition to the theoretical complexity analysis, an experimental evaluation is conducted on the proposed fast flow table lookup method in terms of flow table access time and flow table search length in different traffic scenarios. The experimental results show that the proposed fast flow table lookup method can improve the efficiency of flow table access in a variety of traffic environments.