This is an English translation of the research paper "一种高并发网络环境下快速流表查找方法", published in Acta Electronica Sinica 45(4), .

Wang2017a.en.html
This HTML file for offline reading
Wang2017a.pdf
Original Chinese PDF
Wang2017a.en.zip
Source code for this HTML version of the English translation

Posted .

Fast Flow Table Lookup for High Concurrency Network

Acta Electronica Sinica
Vol. 45 No. 4
April 2017

  1. Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
  2. National Engineering Laboratory for Information Security Technologies, Beijing 100093, China
  3. University of Chinese Academy of Sciences, Beijing 100049, China
  4. National Computer Network Emergency Response and Coordination Center, Beijing 100029, China
  5. Institute of Electronic and Information Engineering, Dongguan UESTC, Dongguan, Guangdong 523808, China
Keywords
hash table;high-concurrency network;flow management;network locality
CLC category number
TP393
Document code
A
Article ID
0372-2112(2017)04-0974-08
Acta Electronica Sinica URL
http://www.ejournal.org.cn
DOI
10.3969/j.issn.0372-2112.2017.04.029

Date received: 2015-03-27; date revised: 2016-05-05; managing editor: 孙瑶 (Sun Yao)
Fund project: Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDA06030200); National Natural Science Foundation of China (No. 61402474); National 242 Information Security Program (No. 2015A087); Guangdong Industry-University-Research Cooperation Project “Guangdong Health Cloud Security Academician Workstation” (No. 2016B090921001)

Abstract

In order to improve the lookup speed of flow tables in a high-speed network environment, this paper first analyzes the characteristics of traffic on OC-192 backbone links. Research shows that backbone links not only have high concurrency and high arrival rates, but also have good network locality in an appropriate cache window. Based on these characteristics and the principle of locality, this paper implements a fast flow table lookup method by adding constant auxiliary space overhead on top of a naïve hash table structure. Theoretical analysis and experiments on real network datasets show that this method can reduce flow table search length by 20.2% and flow table access time by 17.1% compared to existing methods.

1 Introduction

In high-speed network environments, efficient connection management has become a key module of existing network traffic processing systems (e.g., intrusion detection systems, traffic billing systems, etc.)[12]. As shown in Figure 1, typically, the architecture of a traffic processing system is mainly divided into three modules: traffic acquisition, connection management, and service processing. Connection management provides flow traceability for service processing[3], including the three operations: search, update, and delete. In order to accurately record each connection, the connection management module must maintain a flow table (or session table), in which each flow table entry tracks a connection in the network, and is responsible for recording the connection’s ID, status, and other related information, of which the connections ID is globally unique and is generally made up of a 5-tuple in the TCP/IP header: FID {sip, dip, sport, dport, ip protocol}.

There are three dashed boxes aligned horizontally, labeled “Traffic acquisition module”, “Connection management module”, and “Service processing module”. The traffic acquisition module contains three boxes: “NIC”, “Buffer”, and “Load balancing”. NIC and Buffer are shaded. The connection management module contains a “Flow table” box with three subsidiary boxes attached: “Find updates”, “LRU replacement”, and “Timeout deletion”. “Flow table” is shaded but the other three boxes are not. The service processing module contains two boxes, “Content specification” and “Intrusion detection”.
Figure 1: Network traffic processing system architecture

Most of the data structures used in the connection management module are hash tables, and hash collisions are resolved by chaining[46]. This is because hash tables support lookup, insertion, and deletion operations with complexity close to constant time when the total number of flows in the network approximates the length of the table[7]. Existing traffic processing systems use a single-packet scheduling policy. The packets are first cached in the NIC buffer and then sent to the connection management module in order of arrival in the buffer to perform connection status update and maintenance operations. In a high-speed network environment, single-packet processing not only incurs a large amount of function callback overhead, but also leads to performance bottlenecks in flow table access. As the number of concurrent connections increases, the size of the flow table grows. Due to constraints on hardware resources and the limitations of the hash table structure itself, the number of hash table slots needs to be preset, and dynamically adjusting them is very difficult[8]. The growth of collision chains causes the lookup efficiency of single-packet flow tables to drop sharply. In existing high-speed networks with 10 Gbps traffic, the packet rate reaches 10 Mpps or even higher, and the vast majority of packets in the network need to perform a flow table lookup operation. The frequency of flow table lookups is equivalent to the packet arrival rate. The efficiency of flow table lookup has become one of the important performance metrics for traffic processing systems.

According to the 2014 traffic report[9] by the authoritative organization Sandvine, multimedia traffic such as files, audio and video account for more than 70% of traffic in backbone links in the Asia-Pacific region. Meanwhile, in backbone links, TCP traffic accounts for more than 90% of total traffic[10]. The traffic is characterized by high concurrency and slow update, and network traffic is localized to a certain extent[2]. This means that over a period of time, multiple packets from the same connection will arrive one after another.

Based on the traffic characteristics of backbone links, for the problems faced by connection management modules in single-packet processing, this paper firstly proposes a quantitative indicator of traffic locality to quantitatively analyze the locality features of traffic in backbone links. Second, on top of the traditional hash structure, we have added a constant amount of auxiliary overhead space, and designed and implemented a fast flow table lookup (FFTL) method to reduce flow table access overhead. FFTL is executed in three stages: firstly, through an efficient traffic locality polymerization algorithm, the packets in the buffer are grouped by flow ID, so that packets of the same connection are classified into a group; secondly, a threshold scheduling policy is used to schedule the packets that have been grouped; finally, according to the scheduling sequence, each group of packets is sent to the connection management module in bulk to perform operations such as flow table lookup and connection state maintenance. Experimental results show that this method effectively reduces the average search length of flow tables in high-concurrency networks and improves the efficiency of network traffic processing systems.

2 Related work

Current flow table lookup operations are divided into three types of implementations: hash table, Bloom filter[11], and content-addressable memory[12]. For a flow table with a hash table structure, a hash table lookup consists of two steps: calculation of a hash value and comparison of a collision chain. The worst-case performance of handling hash collisions by chaining is poor. When all N keywords are inserted into the same slot, it results in a linked list of length N, which has a worst-case search length of O(N)[7].

Based on this, a lot of work has been focused on making the length of collision chains in each slot as balanced as possible, so as to ensure that the average search length is as close as possible to the best-case O(1 + α), with α being the load factor. To achieve this, a good hash algorithm is required. Reference [4] states that although it is possible to equalize the distribution of the lengths of all collision chains in a hash table with the help of sophisticated cryptographic hashing methods (such as MD5, SHA-1), good hash functions usually consume a large amount of CPU, and even a hardware implementation of the hash operation unit requires 64 clock cycles[13].

Compared with the previously mentioned single hash, the effect of multiple hashing is better[14]. A Bloom filter is a kind of multiple hash, which is usually combined with high-speed hardware (such as TCAM, FPGA[15], etc.) to provide efficient content lookup[16]. References [17] and [18] state that a Bloom filter is commonly used to represent a set to provide member query operations, with probability of error in its query results. Moreover, Bloom filter query results only have two values: yes and no, which does not satisfy the need for connection state maintenance in traffic processing systems. Reference [4] implements a variant of the Bloom filter, FCF (Fingerprint Compressed Filter). Each hash slot of FCF contains d sub-tables of fixed size. Through d mutually independent hash functions, the hash value of a connection is computed d times and it is finally inserted into the shortest of the d sub-tables. However, this makes each lookup search d collision chains, which imposes a large lookup overhead in packet-dense backbone networks, and it uses a single bit for timeout processing, which is a slot-level processing method. Although FCF gains the advantage of fast access to SRAM by compressing the FID, its scalability is limited by the capacity of SRAM.

In terms of optimizing lookup operations with the help of network locality, reference [2] points out through measurement that the backbone network is characterized by high concurrency, with a concurrency of two million, and only about 20,000 new connections appearing per second, which accounts for 1% of total concurrency, and that the backbone network is updated slowly, with a certain degree of locality. Based on this, reference [2] uses FPGA and SRAM to implement a high-speed cache on the flow table to speed up access. Limited by the circuit complexity of FPGA and the capacity of SRAM, it mainly processes the first few bytes of the packet payload and is specifically applied to protocol identification systems. It cannot solve the problem of full packet payload processing in DPI systems. In addition, TCAM (Ternary Content Addressable Memory)[1219] is also used to speed up the lookup operation. However, TCAM is too expensive and consumes a lot of electricity, and the size of the flow table is also limited by storage capacity. In a high-concurrency network environment, a large number of active connections will have to be replaced due to traffic fluctuations and bursts, resulting in system detection misses.

3 Traditional connection management approach and analysis of traffic characteristics

In this section, we first briefly introduce classical connection management approachs, giving their basic lookup complexity and representation; then we analyze the locality of real traffic.

3.1 Naïve connection management approach

A naïve flow table (NFT) is organized using a hash table, as shown in Figure 2. The hash slots of the hash table are implemented using a fixed number of arrays with consecutive addresses, and each slot uses a linked list to handle hash collisions. Each connection is identified with a globally unique flow label, FID, with the packet for flow i labeled as FIDi.

A large dashed blue box encloses other boxes. At the top, a horizontal row of square boxes headed by a dashed orange box “Pkt Buffer:”; the contents of the boxes are “fid4”, “fid76”, “fid32”, “fid1”, blank, blank. Below that, a similar row of horizontal boxes headed by a dashed orange “Bucket”: the contents of the boxes are “1”, “2”, “3”, “…”, “M−2”, “M−1”, “M”. Hanging down from each of the Bucket boxes are vertical columns of boxes representing linked lists: {32, 4, NIL}, {NIL}, {NIL}, …, {NIL}, {76, NIL}, {1, NIL}. The bottom of the first and last two columns is marked by a dashed orange box: CL_{1}, CL_{M−1}, CL_{M}. In the lower left corner there is a dashed gray box with the label “NFT”.
Figure 2: Structure of traditional connection management

Whenever a packet x arrives, it is processed according to Algorithm 1.

Algorithm 1: Naïve flow table lookup
fid←getFID(x.sip,x.dip,x.sport,x.dport.x.ipproto)
hash_value←hash(fid)
j←hash_value%mod
if NFT[j] = NULL then
    p←new flow_items(fid)
    insert(NFT[j],p)
else
    p←search_items(NFT[j],fid)
end if
if p = NULL then
    p←new flow_items(fid)
    insert(NTF[j],p)
end if
update(p,x.state)
return TRUE

The search method traverses the entire collision chain, comparing the FIDs one by one, which determines the search overhead of the whole process. To illustrate the naïve average search length (NASL) of the NFT, if the number of slots in the NFT is defined as M, and the number of elements that have been inserted into the NFT as N, then the load factor of the NFT at this time is

(1)
α=NM\alpha = \frac{N}{M}

i.e., the average number of elements in each collision chain. From the proof in reference [7], it can be seen that the expected number of comparisons for a single lookup in the case of uniform hashing is

(2)
𝑁𝐴𝑆𝐿=1+α2α2N\textit{NASL} = 1 + \frac{\alpha}{2} - \frac{\alpha}{2N}

From this we can determine that the average number of lookups in a flow table is proportional to the number of slots in the hash table and inversely proportional to the total number of elements in the flow table, provided that the hashing is uniform.

3.2 Locality features of backbone network traffic

Locality means that after one packet arrives for a particular connection, packets for that connection will arrive again in the near future. In naïve connection management, almost every packet requires a traversal of the collision chain. With locality, we can aggregate multiple packets in the same connection and look up the flow table only once, avoiding redundant collision chain traversals. To this end, we studied the locality features of real traffic on a OC-192 link in 2014 (with a 14% link utilization) provided by CAIDA[20], a dataset that has been anonymized.

Figure 3 depicts packet arrivals for the same connection as a percentage of all packets, for a total of about 21 minutes’ worth of traffic, averaged over each minute. It can be seen that due to the high concurrency, packets for different connections are separated from each other, and the locality that can be observed is 8.4% on average.

Based on the traffic characteristics, we designed a locality polymerization (LP) method, which quickly groups packets for the same connection through a cache window, so as to reduce the number of flow table traversals. We first give a quantitative indicator of the degree of locality, defined as follows: Arrange the packets in order of arrival into a sequence S. For any subsequence s of length K:

s:{xa1,xa2,,xai,xaj,,xak},s:\{x_{a1},x_{a2},\cdots,x_{ai},x_{aj},\cdots,x_{ak}\},

where ai is the flow label, 1 ≤ i, and j ≤ K.

A line graph. The x-axis is “Time (min)” and goes from 0 to 21. The y-axis is “Propertion of consecutive arrivals of data packet in the same flow” and ranges from 0.05 to 0.10. There is a single time series, “One-minute average” with 22 data points, at 1-minute intervals. The values fluctuate but are always between 0.08 and 0.09. There is an additional thin horizontal line at 0.084, the average.
Figure 3: Locality of the OC-192 link

We divide the sequence s by flow label, and record the total number of divisions as C, indicating that there are C different connections in s. Let

(3)
ρ=CK; it is clear that ρ[CK,1]\rho = \frac{C}{K};\textrm{ it is clear that } \rho \in \left[\frac{C}{K},1\right]

ρ represents the proportion of connections contained in K data packets. Let

(4)
μ=1ρ\mu = 1 - \rho

μ is the degree of locality of a packet sequence of length K. In order to visualize the degree of locality in the network more intuitively, we measured the degree of locality for different values of K in the range [32, 352], as shown in Figure 4.

A line graph. The x-axis is “Cache window K” and goes from 32 to 352. The y-axis is “Degree of locality μ” and goes from 0.38 to 0.54. There is a single time series with points at 32-packet intervals. The series starts at the value 0.38 and smoothly plateaus to about 0.48.
Figure 4: Degree of locality for different values of K

As can be seen in Figure 4, the degree of locality of the OC192 backbone link reaches 35% when K is 32. Due to the characteristics of high concurrency and slow update of the backbone network, combined with an appropriate cache window, good locality can indeed be obtained. We designed a locality polymerization (LP) method on top of the naïve flow table based on these characteristics and implemented the fast flow table lookup method on this basis.

4 Locality polymerization and fast lookup

In this section, we first describe the specific implementation of the locality polymerization method and the fast lookup method, and discuss and improve the post-polymerization scheduling policy, and then analyze the theoretical time complexity of the fast lookup method.

4.1 Locality polymerization and scheduling policy

In order to implement locality polymerization so as to exploit traffic locality, some auxiliary data structures are needed. Based on a naïve hash table, a buffer window and a partition table (PT) to assist in the flow partitioning operation are added, and a queue Q is used to index the groups that have completed the partitioning operation.

As shown in Figure 5, the PT divides the packets for different flows and the size of the index queue Q is set to an amount K equal to the buffer size. Whenever a packet x arrives, it performs an operation as in Algorithm 2.

Algorithm 2: Locality polymerization and fast lookup
fid←getFID(x.sip,x.dip,x.sport,x.dport,x.ipproto)
hash_value←hash(fid)
j←hash_value%mod_pt
cached_num←cached_num + 1
if PT[j] = NULL then
    insert x into PT[j]
    Q.enque(j)
else if PT[j].first.fid = x.fid then
    insert x into PT[j]
else
    call commit
    insert x into PT[j]
    Q.enque(j)
end if
if cached_num = K then
    call commit
end if
function commit
    for all j in Q,do
        x←PT[j].first
        ptsearch_items(NFT[j],x.fid)
        if p = NULL then
            p←new flow_items(fid)
            Insert(NTF[j],p)
        end if
        for all x in PT[j],do
            update(p,x.state)
        end for
        PT[j]←NULL
    end for
    Q.clear()
    return TRUE
end function
A block diagram in the style of Figure 2. A large dashed box encloses everything. Inside is divided into three sections. The top section is “Pkt Buffer”, an array of squares “fid1”, “fid4”, “fid2”, “fid4”, “fid1”. The middle section is “PT”. There is a horizontal row of 6 squares, with vertical columns branching upward from some of the squares: {4, 4, NIL}, {}, {1, 1, 1, NIL}, {}, {2, NIL}, {}. Off to the left, there is a linear structure of 4 squares labeled “Q1”. The top square is labeled “Tail” and the bottom “Head”. The contents of the Q1 list, from head to tail, are: “1”, “2”, “4”, blank. The “1”, “2”, “4” entries of Q1 have one-way arrows coming out that point to the corresponding columns of PT (“4” points to the {4, 4, NIL} column, etc.). The bottom section of the figure is “Bucket”. There’s a horizontal array of squares “1”, “2”, …, “M−2”, “M−1”, “M”. Hanging down from each of the Bucket boxes are vertical columns of boxes representing linked lists: {4, NIL}, {NIL}, … {2, NIL}, {NIL}, {1, NIL}. The bottom of the first and last two columns is marked by a dashed box: CL_{1}, CL_{M−1}, CL_{M}. In the lower left corner there is a dashed black box with the label “FFT”.
Figure 5: Hash table structure for implementation of locality polymerization

From lines 11 and 16, we see that the commit method will be executed only when the number of cache entries reaches K or when a collision occurs in the PT. Because the value chosen for K is relatively small, and the PT can generally be chosen within the range of 1% to 10% of the real table, the probability of a collision is extremely low, and the utilization of locality will not be affected.

From lines 19–21, we know that for each bucket of PT[j], only the first packet of the PT[j] chain performs the collision chain traversal of the real NFT flow table each time, and after that, the other packets in the PT[j] bucket only need to perform the flow state update in turn. In this case, the time overhead of a single lookup is shared equally among multiple packets, thus reducing the average search length. The hash value calculation and FID extraction have already been calculated in lines 1–2, so there is no need to repeat the calculation.

Further improvements in the scheduling policy: From line 19 of Algorithm 2, we know that every time a commit occurs, all cached connections in the index queue are sent to the connection management module. Due to the unevenness of connection arrivals, some connections with only one cached packet are also committed. It is clear from Equations (3) and (4) in Section 3.2 that reducing the occurrence of this situation can improve the effectiveness of polymerization. As a result, we introduce a threshold clearing policy for the commit method in lines 18–32, prioritizing connections with the number of cached packets reaching the historical average for each commit. The following adjustments are made to the locality polymerization module in Figure 5: ① Add a secondary index queue Q2. When the number of cached packets of a connection in the primary queue Q1 reaches a threshold, the connection will be transferred to the secondary index queue for indexing. ② Modify the unidirectional indexing relationship between Q and PT to a bidirectional indexing relationship. ③ Committing connections in Q2 on a priority basis will result in connections in the lower-priority Q1 not being scheduled for a long period of time. For this reason, a starvation avoidance mechanism is introduced by adding a collision counter C1 and a Q2 scheduling accumulator C2. The adjusted module is shown in Figure 6.

A block diagram in the style of Figure 2 and Figure 5. A large dashed box encloses everything. Inside is divided into two sections. The top section is “Pkt Buffer”, an array of squares “fid1”, “fid4”, “fid2”, “fid4”, “fid1”. The bottom sectino is complicated. In the middle is a hash table, a horizontal row of squares with vertical columns of squares stemming off upward of them. The contents of the horizontal row are “Q1:2”, “Q2:2”, “Q2:1”, blank, “Q1:1”. The corresponding vertical columns are {4, 4, NIL}, {3, 3, 3, NIL}, {1, 1, 1, NIL}, {}, {2, NIL}. To the left of the hash table is a vertical column of squares labeled “Q1”. The top square is labeled “Tail” and the bottom one “Head”. The contents of the Q1 list, from head to talk, are “PT:5”, “PT:1”, blank, blank. The “PT:5” square has a bidirectional arrow that connects it to the “Q1:1” bucket of the hash table; similarly “PT:1” is connected to “Q1:2”. To the right of the hash table is “Q2”, a symmetrical mirror of Q1. Its contents, from head to tail, are “PT:3”, “PT:2”, blank, blank. The “PT:3” square is connected with a bidirectional arrow to “Q2:1” in the hash table; similarly “PT:2” is connected to “Q2:2”.
Figure 6: Locality polymerization structure with a secondary index queue

When the polymerization operation collides, C1 increases by 1; when the buffer window is full and C1 is greater than 0, C1 decreases by 1. C2 is used to count the cumulative number of packets committed by Q2: when the buffer is full or when a collision occurs, connections in Q2 are committed first. When the number of packets in Q1 reaches K, or C1 reaches a threshold, or C2 reaches a threshold, connections in Q1 are committed, and at the same time, the counter is cleared. Through this improvement, the polymerization effect can be improved and the scheduling starvation of Q1 can be effectively avoided. The details are shown in Algorithm 3.

Algorithm 3: Improved locality polymerization and fast lookup
fid←getFID(x.sip,x.dip,x.sport,x.dport,x.ipproto)
hash_value←hash(fid)
j←hash_value%mod_pt
cached_num←cached_num + 1
if PT[j] = NULL then
    insert x into PT[j]
    Q1.enque(j)
else if PT[j].first.fid = x.fid then
    insert x into PT[j]
    if PT[j].size > delt then
        move PT[j] from Q1 to Q2
    end if
else
    move PT[j] from Q1 to Q2
    call commit(Q2)
    C1←C1 + 1
    insert x into PT[j]
    Q1.enque(j)
end if
if cached_num = K then
    C1←max(C1 - 1,0)
    call commit(Q2)
end if
if Q1.pkt_size = K or C1 > delt or C2 > 10 * Q1.pkt_size then
    call commit(Q1)
    C1←1
    C2←20
end if
function commit
    if Q is Q2 then
        C2←C2 + Q2.pkt_size
    end_if
    for all j in Q,do
        x←PT[j].first
        ptsearch_items(NFT[j],x.fid)
        if p = NULL then
            p←new flow_items(fid)
            Insert(NTF[j],p)
        end if
        for all x in PT[j],do
            update(p,x.state)
        end for
        PT[j]←NULL
    end for
    Q.clear()
    return TRUE
end function

4.2 Complexity analysis of the fast lookup method

Following the steps in Section 4.1, we divide the computation of the average search length of the fast lookup method (FASL) into two parts; i.e., the average search length of the partition table, PASL, and the average search length of the naïve flow table after locality polymerization, NASL′.

(5)
FASL=PASL+NASL\textrm{FASL} = \textrm{PASL} + \textrm{NASL}'

First, we analyze the search length in PT. Since the size of the cache window is K, and the number of different connections in it is C = ρK, after the first packet of each connection is determined, all the other packets of that connection require one FID comparison (corresponding to Steps 2 and 3 in Section 4.1); i.e., the number of times a FID comparison operation occurs in K packets is:

(6)
PASL=KCK\textrm{PASL} = \frac{K - C}{K}

Next, let’s take a look at the average search length in NFT. After polymerization, each connection only needs the packet at the head of the chain to perform the lookup operation (corresponding to step 3 in Section 4.1), so:

(7)
NASL=C*NASLK\textrm{NASL}'' = \frac{C * \textrm{NASL}}{K}

Thus, the average search length of K packets using the fast lookup method is:

(8)
FASL=KCK+C*NASLK\textrm{FASL} = \frac{K - C}{K} + \frac{C * \textrm{NASL}}{K}

Finally, we get the relational expression between the average search length and the degree of locality as follows:

(9)
FASL=NASLμ*(NASL1)\textrm{FASL} = \textrm{NASL} - \mu * (\textrm{NASL} - 1)

From Equation (9), we can see that FASL is always smaller than NASL when NASL is greater than 1; i.e., the fast lookup method does reduce the average search length. Compared with the naïve hash table, the auxiliary space of constant overhead introduced by the fast lookup method is O(K + LT).

5 Experimental results and analysis

In order to verify the effectiveness of FFTL, we use two evaluation metrics—the average search length and the average access time of the flow table—to evaluate the experimental results of FFTL in comparison with the connection management module of the open-source intrusion detection system Snort. We categorize the experimental scenarios according to the traffic source and duration of the dataset as follows.

Table 1: Summary characteristics of the dataset
Scenario Source Average bps Maximum bps Duration
C1 OC-192 6.55G 7.2G 20min
C2 school gateway 9.2G 9.4G 20min
C3 school gateway 9.1G 9.6G 4h

The system environment used for the experiment is Redhat 6.3, kernel 2.6.32-279, Intel Xeon 8-core 1.8 GHz CPU, 32 GB of RAM, and Intel 82599 10 GbE NIC.

Scenario 1

Since the C1 dataset contains only IP and TCP header information and no other payload data, the time dimension cannot be evaluated, so only experiments on the search length were performed.

Figure 7 compares the average search length of the Snort connection management module, FFTL, and FFTL (MFFTL) of the introduced threshold scheduling policy in C1, where the value of K for both FFTL and MFFTL is set to 256. The results show that the fast lookup method can effectively reduce the average search length of the flow table. Figure 8 compares the average search length of MFFTL when K is set to 64, 128, and 256, which decreases to a certain extent as the window increases. Considering that an increase in the value of K will lead to an increase in packet processing latency, the choice of the cache window size should take into account a tradeoff with the system’s tolerance for latency.

The graph is difficult to interpret, but it looks like a line plot with overlaid points indicating moving averages. The x-axis is “Time (s)” and the y-axis is “Average flow table search length”. There are three time series indicated by colored lines: “Snort”, “FFTL”, and “MFFTL”. Each line oscillates at a high rate, looking like a comb and causing moire interference with the pixel grid. Besides the lines, there are also three corresponding series of points: “MV6(Snort)”, “MV6(FFTL)”, “MV6(MFFTL)”. The “MV6” notation is not explained, but it is evidently a moving average, as the points remain near the midpoint of the lines’ osciallations. The Snort series smoothly increase and reach a maximum maximum search length of about 1.95 once the time exceeds 200 s. The shape of the FFTL and MFFTL is similar, but they reach a maximum of only about 1.4.
Figure 7: Flow table search length in the C1 scenario
The graph is difficult to interpret, but it looks like a line plot with overlaid points indicating moving averages. The x-axis is “Time (s)” and the y-axis is “Average flow table search length”. There are three time series indicated by colored lines: “K=64”, “K=128”, and “K=256”. Each line oscillates at a high rate, looking like a comb and causing moire interference with the pixel grid. Besides the lines, there are also three corresponding series of points: “MV6(K=64)”, “MV6(K=128)”, “MV6(K=256)”. The “MV6” notation is not explained, but it is evidently a moving average, as the points remain near the midpoint of the lines’ osciallations. There is substantial overplotting of the lines, but from the moving averages it is clear that there is an ordering between MV6(K=64), MV6(K=128), and MV6(K=256). MV6(K=64) tops out at a search length of about 1.45; MV6(K=128) at 1.41; and MV6(K=256) at 1.39.
Figure 8: Performance of MFFTL with different K values in the C1 scenario

Scenarios 2 and 3

In C2 and C3, we conducted comparative experiments in two dimensions, namely the flow table execution time and the flow table access length. In order to use the same experimental parameters across multiple datasets, the window K was uniformly set to 256.

As can be seen from the experimental results in Figures 9 to 12, both FFTL and MFFTL have significant decreases in the two dimensions of flow table access time and search length over both short and long time scales, with FFTL decreasing by about 14.3% in the time dimension and 16.7% in the search length dimension, and MFFTL decreasing by about 17.1% in the time dimension and 20.2% in the search length dimension.

A line graph with overlaid points, as in Figure 7, with the same set of sources: “Snort”, “FFTL”, “MFFTL” and their corresponding (presumably) moving averages “MV6(Snort)”, “MV6(FFTL)”, “MV6(MFFTL)”. The x-axis is “Time (s)” and goes from 0 to 1200; the y-axis is “Average flow table search length” and goes from 0.9 to 1.6. The lines oscillate even more wildly than in Figure 7, but the moving averages form nearly smooth curves. Snort increases from 1.0 at time 0 to about 1.4 at time 200, then slowly increase to a maximum of about 1.5. The other two have a similar shape. FFTL gets to a maximum of about 1.25, and MFFTL gets to about 1.2.
Figure 9: Comparison of search lengths in the C2 scenario
A line graph with overlaid points, as in Figures 7 and 9, with the same set of sources: “Snort”, “FFTL”, “MFFTL” and their corresponding (presumably) moving averages “MV6(Snort)”, “MV6(FFTL)”, “MV6(MFFTL)”. The x-axis is “Time (s)” and goes from 0 to 1200; the y-axis is “Average flow table access time (CPU cycles)” and goes from 2600 to 4000. The lines oscillate even more wildly than in Figure 7, but the moving averages form nearly horizontal lines, after wobbling a bit below time 200. Snort is at about 3500 CPU cycles; FFTL at 3000; and MFFTL at 2900. The numeric tick labels on the y-axis are not perfectly aligned with the ticks themselves, so the values are somewhat questionable.
Figure 10: Comparison of flow table access times in the C2 scenario
A line graph with three time series: “Snort”, “FFTL”, “MFFTL”. Unlike Figures 7–10, there are no overlaid “MV6” moving averages; the lines themselves oscillate only in very narrow ranges and are easy to distinguish from one another. The x-axis is “Time (10s)” and goes from 0 to 14000 (the ticks are labeled 0, 200, 400, … 1400, but they are presumably meant to be multiplied by 10); the y-axis is “Average flow table search length” and goes from 1.0 to 1.6. Snort increases to about 1.5 above time 500, then slowly increases further until time 6000 up to about 1.55. The other two form nearly horizontal lines, FFTL at 1.26, and MFFTL just below it at 1.22 (decreasing slowly to 1.2 at the far right end of the graph).
Figure 11: Comparison of search lengths in the C3 scenario
A line graph with three time series: “Snort”, “FFTL”, “MFFTL”. Unlike Figures 7–10, there are no overlaid “MV6” moving averages; the lines themselves oscillate only in very narrow ranges and are easy to distinguish from one another. The x-axis is “Time (10s)” and goes from 0 to 14000 (the ticks are labeled 0, 200, 400, … 1400, but they are presumably meant to be multiplied by 10); the y-axis is “Average flow table access time (CPU cycles)” and goes from 2600 to 3600. The shapes of the series are very similar to Figure 10. Snort increases to about 400 above time 500, then slowly increases further until time 6000 up to about 1.55. The other two form nearly horizontal lines, FFTL at 3200, and MFFTL just below it at 3150 (decreasing slowly to 3100 at the far right end of the graph). The numeric tick labels on the y-axis are not perfectly aligned with the ticks themselves, so the values are somewhat questionable.
Figure 12: Comparison of flow table access times in the C3 scenario

6 Conclusion

By measuring backbone network traffic and analyzing the locality features of traffic in high-speed backbone links, this paper presents a quantitative index of traffic locality and an efficient traffic locality polymerization method. On this basis, a fast flow table lookup method is designed and implemented, the post-polymerization scheduling policy is improved, and a threshold scheduling policy and a starvation avoidance mechanism are introduced. In addition to the theoretical complexity analysis, an experimental evaluation is conducted on the proposed fast flow table lookup method in terms of flow table access time and flow table search length in different traffic scenarios. The experimental results show that the proposed fast flow table lookup method can improve the efficiency of flow table access in a variety of traffic environments.

References

[1]
NAM G, PATANKAR P, KESIDIS G, et al. Mass purging of stale tcp flows in per-flow monitoring systems [A]. Proceedings of 18th Internatonal Conference on Computer Communications and Networks [C]. San Francisco: IEEE, 2009. 1–6.
[2]
PAN T, GUO X, ZHANG C, et al. Tracking millions of flows in high speed networks for application identification [A]. Proceedings of 31st Annual IEEE International Conference on Computer Communications [C]. Orlando: IEEE, 2012. 1647–1655.
[3]
NOURELDIEN N A, OSMAN I M. A stateful inspection module architecture [A]. TENCON 2000: Proceedings of Intelligent Systems and Technologies for the New Millennium [C]. Kuala Lumpur: IEEE, 2000. 259–265.
[4]
DHARMAPURIKAR S, SONG H, TURNER J, et al. Fast packet classification using bloom filters [A]. Proceedings of the 2006 ACM/IEEE Symposium on Architecture for Networking and Communications Systems [C]. California: IEEE, 2006:61–70.
[5]
Snort-an open source network intrusion prevention system [OL]. http://www.snort.org, 2013-12-30/2014-09-13.
[6]
An Implementation of an E-component of Network Intrusion Detection System [OL]. https://sourceforge.net/projects/libnids/files/, 2010-03-14/2015-03-02.
[7]
CORMEN T H, LEISERSON C E, RIVEST R L, et al. Introduction to Algorithms [M]. Cambridge: MIT Press, 2009. 253–285.
[8]
NAM G, PATANKAR P, LIM S H, et al. Clock-like flow replacement schemes for resilient flow monitoring [A]. The 29th Int’l Conference on Distributed Computing Systems [C]. Quebec: IEEE, 2009. 129–136.
[9]
Sandvine-Intelligent Broadband Networks [OL]. https://www.sandvine.com/downloads/general/global-internet-phenomena/2014/2h-2014-global-internet-phenomena-report.pdf, 2014-11-21/2015-02-26.
[10]
张广兴,等.Internet 城域出口链路流量测量与特征分析[J].电子学报,2007,35(11):2092-2097.
ZHANG G X, et al. Internet traffic measurement and characteristic analysis on output link of metro area network [J]. Acta Electronica Sinica, 2007, 35(11):2092–2097. (in Chinese)
[11]
BONOMI F, MITZENMACHER M, PANIGRAH R, et al. Beyond Bloom filters: om approximate membership checks to approximate state machines [J]. SIGCOMM06, 2006, 36(4):315–326.
[12]
MEINERS C R, LIU A X, TORNG E. TCAM razor: a systematic approach towards minimizing packet classifiers in TCAMs [J]. IEEE/ACM Transactions on Networking, 2010, 18(2):490–500.
[13]
SONG H, DHARMAPURIKAR S, TURNER J, et al. Fast hash table lookup using extended bloom filter: an aid to network processing [J]. ACM SIGCOMM Computer Communication Review, 2005, 35(4):181–192.
[14]
AYUSO PN. Netfilter’s connection tracking system [J]. Login: the Magazine of USENIX & SAGE, 2006, 31(3):34–39.
[15]
YOON S, KIM B, OH J, et al. High Performance Session State Management Scheme for Stateful Packet Inspection [M]. Berlin: Springer Berlin Heidelberg, 2007. 591–594.
[16]
周舟,等.一种基于并行 Bloom Filter 的高速 URL 查找算法[J].电子学报,2015,43(9):1822-1840.
ZHOU Z, et al. Fast URL lookup using parallel bloom filters [J]. Acta Electronica Sinica, 2015, 43(9):1822–1840. (in Chinese)
[17]
谢鲲,等.布鲁姆过滤器代数运算探讨[J].电子学报,2008,36(5):869–874.
IE K, et al. Algebraic operations on bloom filters [J]. Acta Electronica Sinica, 2008, 36(5):869–874. (in Chinese)
[18]
王洪波,等.高速网络超连接主机检测中的流抽样算法研究[J].电子学报,2008,36(4):809-818.
WANG H B, et al. On flow sampling for identifying super-connection hosts in high speed networks [J]. Acta Electronica Sinica, 2008, 36(4):809–818. (in Chinese)
[19]
LAKSHMINARAYANAN K, RANGARAJAN A, VENKATACHARY S. Algorithms for advanced packet classification with ternary CAMs [J]. Acm Sigcomm Computer Communication Review, 2005, 35(4):193–204.
[20]
Center for Applied Internet Data Analysis [OL]. http://www.caida.org/data/passive/passive 2014 dataset.xml, 2015-01-20/2015-02-26.

Author bios

王鹏 (Wang Peng), male, born 1990, from Heze, Shandong, master’s degree candidate. Main research areas include cybersecurity and high-performance networks.

张良 (Zhang Liang), male, born 1980, from Dandong, Liaoning, PhD. Main research areas include computer system architecture and microprocessor design and verification.

周舟 (Zhou Zhou) (corresponding author), male, born 1983, from Changsha, Hunan, PhD. Main research areas include cybersecurity and high-performance networks.
E-mail: zhouzhou@iie.ac.cn

刘庆云 (Liu Qingyun), male, born 1980, from Hengshui, Hebei, PhD. Main research areas includes cybersecurity.

方滨兴 (Fang Binxing), male, born 1960, from Harbin, Heilongjiang, PhD. Main research areas includes cybersecurity.