2020__NSDI__AccelTCP Accelerating Network Applications with Stateful TCP Offloading

## Abstract The performance of modern key-value servers or layer-7 load balancers often heavily depends on the efficiency of the underlying TCP stack. Despite numerous optimizations such as kernel-bypassing and zerocopying, performance improvement with a TCP stack is fundamentally limited due to the protocol conformance overhead for compatible TCP operations. Unfortunately, the protocol conformance overhead amounts to as large as 60% of the entire CPU cycles for short-lived connections or degrades the performance of L7 proxying by 3.2x to 6.3x. 最近のキーバリューサーバやレイヤ 7 ロードバランサの性能は、基礎となる TCP スタックの効率に大きく依存します。カーネルバイパスやゼロコピーのような多数の最適化にもかかわらず、TCP スタックによるパフォーマンスの向上は、互換性のある TCP 操作のためのプロトコル適合性オーバーヘッドのため、基本的に制限されています。残念なことに、プロトコル適合性のオーバーヘッドは、短命の接続では CPU サイクル全体の 60%にもなり、L7 プロキシの性能を 3.2 倍から 6.3 倍にまで低下させます。 This work presents AccelTCP, a hardware-assisted TCP stack architecture that harnesses programmable network interface cards (NICs) as a TCP protocol accelerator. AccelTCP can offload complex TCP operations such as connection setup and teardown completely to NIC, which simplifies the host stack operations and frees a significant amount of CPU cycles for application processing. In addition, it supports running connection splicing on NIC so that the NIC relays all packets of the spliced connections with zero DMA overhead. Our evaluation shows that AccelTCP enables short-lived connections to perform comparably to persistent connections. It also improves the performance of Redis, a popular in-memory key-value store, and HAProxy, a widely-used layer-7 load balancer, by 2.3x and 11.9x, respectively. 本論文では、プログラマブル・ネットワーク・インタフェース・カード（NIC）をTCPプロトコル・アクセラレータとして利用する、ハードウェア支援型のTCPスタック・アーキテクチャであるAccelTCPを紹介します。AccelTCPは、接続のセットアップやティアダウンなどの複雑なTCP操作を完全にNICにオフロードすることができ、ホストスタックの操作を簡素化し、アプリケーション処理のために大量のCPUサイクルを解放することができます。さらに、NIC上での接続スプライシングの実行をサポートしているため、NICはスプライシングされた接続のすべてのパケットをDMAのオーバーヘッドなしで中継することができます。我々の評価では、AccelTCPにより、短命の接続が永続的な接続と同等のパフォーマンスを実現できることが示されている。また、一般的なインメモリ鍵値ストアである Redis と、広く利用されているレイヤ 7 ロードバランサである HAProxy の性能も、それぞれ 2.3 倍と 11.9 倍に向上している。 ## Introduction Transmission Control Protocol (TCP) [24] is undeniably the most popular protocol in modern data networking. It guarantees reliable data transfer between two endpoints without overwhelming either end-point nor the network itself. It has become ubiquitous as it simply requires running on the Internet Protocol (IP) [23] that operates on almost every physical network. 伝送制御プロトコル(TCP) [24]は、紛れもなく現代のデータネットワーキングで最もポピュラーなプロトコルです。これは、どちらかのエンドポイントやネットワーク自体を圧倒することなく、2つのエンドポイント間の信頼性の高いデータ転送を保証します。ほぼすべての物理ネットワーク上で動作するインターネットプロトコル(IP) [23]上で実行するだけで済むため、ユビキタスなものとなっています。 Ensuring the desirable properties of TCP, however, often entails a severe performance penalty. This is especially pronounced with the recent trend that the gap between CPU capacity and network bandwidth widens. Two notable scenarios where modern TCP servers suffer from poor performance are handling short-lived connections and layer-7 (L7) proxying. Short-lived connections incur a serious overhead in processing small control packets while an L7 proxy requires large compute cycles and memory bandwidth for relaying packets between two connections. しかし、TCP の望ましい特性を確保するためには、多くの場合、厳しいパフォーマンスのペナルティが必要となります。これは、特に最近の傾向として、CPU の容量とネットワークの帯域幅の間のギャップが拡大しているために顕著になっています。最近の TCP サーバがパフォーマンスの低下に悩まされる 2 つの顕著なシナリオは、短命接続の処理とレイヤ 7 (L7) プロキシです。短命接続では、小さな制御パケットの処理に深刻なオーバーヘッドが発生する一方で、L7プロキシでは、2つの接続間でパケットを中継するために大きな計算サイクルとメモリ帯域幅を必要とします。 While recent kernel-bypass TCP stacks [5, 30, 41, 55, 61] have substantially improved the performance of short RPC transactions, they still need to track flow states whose computation cost is as large as 60% of the entire CPU cycles (Section §2). An alternative might be to adopt RDMA [37, 43] or a custom RPC protocol [44], but the former requires an extra in-network support [7, 8, 70] while the latter is limited to closed environments. On the other hand, an application-level proxy like L7 load balancer (LB) may benefit from zero copying (e.g., via the splice() system call), but it must perform expensive DMA operations that would waste memory bandwidth. 最近のカーネルバイパスTCPスタック[5, 30, 41, 55, 61]は短いRPCトランザクションの性能を大幅に向上させたが、それらはまだ、計算コストがCPUサイクル全体の60%にもなるフロー状態を追跡する必要がある(セクション§2)。代替手段としては、RDMA [37, 43]やカスタムRPCプロトコル[44]を採用することも考えられるが、前者はネットワーク内での追加サポート[7, 8, 70]を必要とするのに対し、後者はクローズド環境に限定されている。一方、L7ロードバランサー(LB)のようなアプリケーションレベルのプロキシは、ゼロコピー(例えば、splice()システムコールを介して)の恩恵を受けるかもしれないが、メモリ帯域幅を浪費する高価なDMA操作を実行しなければならない。 The root cause of the problem is actually clear – the TCP stack must maintain mechanical protocol conformance regardless of what the application does. For instance, a key-value server has to synchronize the state at connection setup and closure even when it handles only two data packets for a query. An L7 LB must relay the content between two separate connections even if its core functionality is determining the back-end server. 問題の根本的な原因は、実際には明らかです。アプリケーションが何をするかに関わらず、TCP スタックは機械的なプロトコル適合性を維持しなければなりません。例えば、KVサーバは、問い合わせのために2つのデータパケットしか処理しない場合でも、接続のセットアップと終了時に状態を同期させなければなりません。L7 LB は、たとえそのコア機能がバックエンドサーバを決定するものであっても、2 つの別々の接続間でコンテンツを中継しなければなりません。 AccelTCP addresses this problem by exploiting modern network interface cards (NICs) as a TCP protocol accelerator. It presents a dual-stack TCP design that splits the functionality between a host and a NIC stack. The host stack holds the main control of all TCP operations; it sends and receives data reliably from to applications and performs control-plane operations such as congestion and flow control. In contrast to existing TCP stacks, however, it accelerates TCP processing by selectively offloading stateful operations to the NIC stack. Once offloaded, the NIC stack processes connection setup and teardown as well as connection splicing that relays packets of two connections entirely on NIC. The goal of AccelTCP is to extend the performance benefit of traditional NIC offload to short-lived connections and application-level proxying while being complementary to existing offloading schemes. AccelTCPは、TCPプロトコルアクセラレータとして最新のネットワークインターフェースカード(NIC)を利用することで、この問題に対処しています。AccelTCPは、ホストスタックとNICスタックの間で機能を分割するデュアルスタックTCP設計を提案しています。ホストスタックは、すべての TCP 操作の主制御を保持し、アプリケーションとの間でデータを確実に送受信し、輻輳やフロー制御などのコントロールプレーン操作を実行します。しかし、既存の TCP スタックとは対照的に、ステートフルな処理を NIC スタックに選択的にオフロードすることで TCP 処理を高速化します。オフロードされると、NIC スタックはコネクションのセットアップとティアダウン、および 2 つのコネクションのパケットを NIC 上で完全に中継するコネクション・スプライシングを処理します。AccelTCPの目標は、既存のオフロードスキームを補完しながら、従来のNICオフロードのパフォーマンスメリットを短命接続やアプリケーションレベルのプロキシにまで拡張することである。 Our design brings two practical benefits. First,it significantly saves the compute cycles and memory bandwidth of the host stack as it simplifies the code path. Connection management on NIC simplifies the host stack as the host needs to keep only the established connections as well as it avoids frequent DMA operations for small control packets. Also, forwarding packets of spliced connections directly on NIC eliminates DMA operations and application-level processing. This allows the application to spend precious CPU cycles on its main functionality. Second, the host stack makes an offloading decision flexibly on a per-flow basis. When an L7 LB needs to check the content of a response of select flows, it opts them out of offloading while other flows still benefit from connection splicing on NIC. When the host stack detects overload of the NIC, it can opportunistically reduce the offloading rate and use the CPU instead. 我々の設計は2つの実用的な利点をもたらします。第一に、コードパスが単純化されるため、ホストスタックの計算サイクルとメモリ帯域幅が大幅に節約されます。NIC上での接続管理は、ホストが確立された接続のみを保持する必要があるため、ホストスタックを簡素化し、小さな制御パケットのための頻繁なDMA操作を避けることができます。また、スプライスされた接続のパケットを直接NIC上で転送することで、DMA演算やアプリケーションレベルの処理が不要になります。これにより、アプリケーションは貴重なCPUサイクルを主要な機能に費やすことができます。第二に、ホストスタックは、フローごとに柔軟にオフロードの判断を行います。L7 LBが選択したフローのレスポンスの内容をチェックする必要がある場合、L7 LBはそれらのフローをオフロードから除外しますが、他のフローはNIC上の接続スプライシングの恩恵を受けることができます。ホストスタックがNICのオーバーロードを検出した場合、オフローディング率を適宜下げてCPUを代わりに使用することができます。 However, performing stateful TCP operations on NIC is non-trivial due to following challenges. First, maintaining consistency of transmission control blocks (TCBs) across host and NIC stacks is challenging as any operation on one stack inherently deviates from the state of the other. To address the problem, AccelTCP always transfers the ownership of a TCB along with an offloaded task. This ensures that a single entity solely holds the ownership and updates its state at any given time. Second, stateful TCP operations increase the implementation complexity on NIC. AccelTCP manages the complexity in two respects. First, it exploits modern smart NICs equipped with tens of processing cores and a large memory, which allows flexible packet processing with C andor P4 [33]. Second, it limits the complexity by resorting to a stateless protocol or by cooperating with the host stack. As a result, the entire code for the NIC stack is only 1,501 lines of C code and 195 lines of P4 code, which is small enough to manage on NIC. しかし、NIC上でステートフルTCPオペレーションを実行することは、以下のような課題があります。第一に、ホストスタックとNICスタックの間で送信制御ブロック（TCB）の一貫性を維持することは、一方のスタック上の操作が他方のスタックの状態から本質的に逸脱するため、困難である。この問題に対処するために、AccelTCPは常にTCBの所有権をオフロードされたタスクと一緒に転送する。これにより、単一のエンティティが単独で所有権を保持し、任意の時間にその状態を更新することが保証される。第二に、ステートフルTCPオペレーションはNIC上での実装の複雑さを増大させる。AccelTCPは、2つの点でこの複雑さに対処しています。第一に、数十個の処理コアと大容量メモリを備えた最新のスマートNICを利用し、CやP4[33]を使った柔軟なパケット処理を可能にしている。第二に、ステートレスプロトコルに頼ったり、ホストスタックと協調したりすることで、複雑さを制限しています。その結果、NICスタックのコード全体は、Cコードが1,501行、P4コードが195行であり、NIC上で管理するには十分に小さい。 Our evaluation shows that AccelTCP brings an enormous performance gain. It outperforms mTCP [41] by 2.2x to 3.8x while it enables non-persistent connections to perform comparably to persistent connections on IX [30] or mTCP. AccelTCP’s connection splicing offload achieves a full line rate of 80 Gbps for L7 proxying of 512-byte messages with only a single CPU core. In terms of real-world applications, AccelTCP improves the performance of Redis [17] and HAProxy [6] by a factor of 2.3x and 11.9x, respectively. 我々の評価によると、AccelTCPは非常に大きな性能向上をもたらすことが示されている。AccelTCPはmTCP [41]を2.2倍から3.8倍も上回る性能を発揮する一方で、非永続接続はIX [30]やmTCP上の永続接続と同等の性能を発揮することが可能である。AccelTCPの接続スプライシングオフロードは、512バイトのメッセージを1つのCPUコアだけでL7プロキシする場合、80Gbpsのフルラインレートを実現している。実世界のアプリケーションでは、AccelTCPはRedis [17]とHAProxy [6]の性能をそれぞれ2.3倍、11.9倍向上させている。 The contribution of our work is summarized as follows. (1) We quantify and present the overhead of TCP protocol conformance in short-lived connections and L7 proxying. (2) We present the design of AccelTCP, a dual-stack TCP processing system that offloads select features of stateful TCP operations to NIC. We explain the rationale for our target tasks of NIC offload, and present a number of techniques that reduce the implementation complexity on smart NIC. (3) We demonstrate a significant performance benefit of AccelTCP over existing kernel-bypass TCP stacks like mTCP and IX as well as the benefit to real-world applications like a key-value server and an L7 LB. 本研究の貢献は以下の通りである。(1)短命接続とL7プロキシにおけるTCPプロトコルコンフォーマンスのオーバーヘッドを定量化して提示する。(2) ステートフルTCP処理の一部の機能をNICにオフロードするデュアルスタックTCP処理システムであるAccelTCPの設計を発表する。また、NICオフロードの対象となるタスクの根拠を説明し、スマートNIC上での実装の複雑さを軽減するためのいくつかの技術を提示する。(3) mTCPやIXのような既存のカーネルバイパスTCPスタックと比較して、AccelTCPの性能面での大きなメリットと、KVSサーバやL7 LBのような実世界のアプリケーションへのメリットを実証する。 ## 2 Background and Motivation In this section, we briefly explain the need for an NIC-accelerated TCP stack, and discuss our approach. このセクションでは、NICアクセラレーションTCPスタックの必要性と、我々のアプローチについて簡単に説明します。 ### 2.1 TCP Overhead in Short Connections & L7 Proxying Short-lived TCP connections are prevalent in data centers [31, 65] as well as in wide-area networks [54, 64, 66]. L7 proxying is also widely used in middlebox applications such as L7 LBs [6, 36] and application-level gateways [2, 19]. Unfortunately, application-level performance of these workloads is often suboptimal as the majority of CPU cycles are spent on TCP stack oper- ations. To better understand the cost, we analyze the overhead of the TCP stack operations in these workloads. To avoid the inefficiency of the kernel stack [38, 39, 60], we use mTCP [41], a scalable user-level TCP stack on DPDK [10], as our baseline stack for evaluation. We use one machine for a server (or proxy) and four clients and four back-end servers, all equipped with a 40GbE NIC. The detailed experimental setup is in Section §6. L7プロキシは、L7 LB [6, 36]やアプリケーションレベルのゲートウェイ [2, 19]などのミドルボックスアプリケーションでも広く使用されている。L7プロキシは、L7 LB [6, 36]やアプリケーションレベルのゲートウェイ [2, 19]などのミドルボックスアプリケーションでも広く使用されています。残念ながら、これらのワークロードのアプリケーションレベルのパフォーマンスは、CPUサイクルの大部分がTCPスタックの操作に費やされているため、しばしば最適ではありません。このコストを理解するために、これらのワークロードにおけるTCPスタック操作のオーバーヘッドを分析しました。カーネルスタックの非効率性 [38, 39, 60] を回避するために，DPDK [10] 上のスケーラブルなユーザレベルの TCP スタックである mTCP [41] を評価用のベースラインスタックとして使用しました．サーバ (またはプロキシ) には1台のマシンを使用し、4台のクライアントと4台のバックエンド・サーバはすべて40GbEのNICを装備しています。詳細な実験設定はセクション§6にあります。 **Small message transactions**: To measure the over- head of a short-lived TCP connection, we compare the performance of non-persistent vs. persistent connec- tions with a large number of concurrent RPC transac- tions. We spawn 16k connections where each transac- tion exchanges one small request and one small response (64B) between a client and a server. A non-persistent connection performs only a single transaction while a persistent connection repeats the transactions without a closure. To minimize the number of small packets, we patch mTCP to piggyback every ACK on the data packet. **少量のメッセージ・トランザクション。**短命なTCPコネクションのオーバーヘッドを測定するために、多数のRPCトランザクションを同時に行う非永続的コネクションと永続的コネクションの性能を比較しました。クライアントとサーバの間で1つの小さなリクエストと1つの小さなレスポンス（64B）をやり取りする16k個のコネクションを作成しました。非永続的なコネクションは1つのトランザクションのみを実行し、永続的なコネクションはトランザクションを閉じることなく繰り返し実行します。スモールパケットの数を最小限にするために、mTCPにパッチを当てて、データパケットにすべてのACKをピギーバックさせます。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/70024299-12a4-4313-8846-ce7a9ffc67ef/ScreenShot_2021-03-24_at_11.02.13.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/70024299-12a4-4313-8846-ce7a9ffc67ef/ScreenShot_2021-03-24_at_11.02.13.png) ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c2053ae2-e5d6-46e7-809d-455024ad001c/ScreenShot_2021-03-24_at_11.04.21.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c2053ae2-e5d6-46e7-809d-455024ad001c/ScreenShot_2021-03-24_at_11.04.21.png) Figure 1 shows that persistent connections outper- form non-persistent connections by 2.6x to 3.2x. The connection management overhead is roughly propor- tional to the number of extra packets that it handles; two packets per transaction with a persistent connection vs. six 1 packets for the same task with a non-persistent con- nection. Table 1 shows the breakdown of the CPU cycles where almost 60% of them are attributed to connection setup and teardown. The overhead mainly comes from TCP protocol handling with connection table management, TCB construction and destruction, packet I/O, and L2/L3-level processing of control packets. 図1は、持続的接続が非持続的接続を2.6倍から3.2倍上回っていることを示しています。接続管理のオーバーヘッドは、処理する余分なパケットの数にほぼ比例します。同じタスクで非持続的接続の場合は6個のパケットを処理するのに対し、持続的接続の場合は1トランザクションあたり2個のパケットを処理します。表1はCPUサイクルの内訳を示していますが、そのうちの約60%は接続のセットアップとティアダウンに起因しています。このオーバーヘッドは主に、接続テーブルの管理、TCBの構築と破壊、パケットI/O、制御パケットのL2/L3レベルの処理など、TCPプロトコルの処理に起因するものです。 Our experiments may explain the strong preference to persistent connections in data centers. However, not all applications benefit from the persistency. When ap- plication data is inherently small or transferred sporadically [32, 69], it would result in a period of inactivity that taxes on server resources. Similarly, persistent connections are often deprecated in PHP applications to avoid the risk of resource misuse [28]. In general, supporting persistent connections is cumbersome and error-prone because the application not only needs to keep track of connection states, but it also has to periodically check connection timeout and terminate idle connections. By eliminating the connection management cost with NIC offload, our work intends to free the developers from this burden to choose the best approach without performance concern. **今回の実験により、データセンターでは持続的な接続が強く求められていることが説明できるかもしれません。しかし、すべてのアプリケーションが持続的接続の恩恵を受けるわけではありません。アプリケーションのデータが本質的に小さい場合や、散発的に転送される場合 [32, 69]、サーバーのリソースを圧迫する非アクティブな期間が発生します。同様に、PHPアプリケーションでは、リソースの不正使用のリスクを回避するために、持続的な接続はしばしば推奨されません[28]。一般に、持続的な接続をサポートすることは面倒であり、エラーが発生しやすくなります。なぜなら、アプリケーションは接続状態を追跡する必要があるだけでなく、接続のタイムアウトを定期的にチェックし、アイドル状態の接続を終了させる必要があるからです。**本研究では、NICのオフロードにより接続管理のコストを削減することで、開発者がこの負担から解放され、性能を気にせずに最適なアプローチを選択できるようにすることを目的としています。 **Application-level proxying**: An L7 proxy typically operates by (1) terminating a client connection (2) accepting a request from the client and determining the back-end server with it, and creating a server-side connection, and (3) relaying the content between the client and the back-end server. While the key functionality of an L7 proxy is to map a client-side connection to a back-end server, it consumes most of CPU cycles on re- laying the packets between the two connections. Packet relaying incurs a severe memory copying overhead as well as frequent context switchings between the TCP stack and the application. While zero-copying APIs like splice() can mitigate the overhead, DMA operations between the host memory and the NIC are unavoidable even with a kernel-bypass TCP stack. **アプリケーションレベルのプロキシ**。L7プロキシは通常、(1)クライアントの接続を終了する (2)クライアントからのリクエストを受け取り、それに伴ってバックエンドサーバを決定し、サーバ側の接続を作成する (3)クライアントとバックエンドサーバの間でコンテンツを中継する、という動作を行う。L7プロキシの主な機能は、クライアント側の接続をバックエンドサーバーにマッピングすることですが、2つの接続間のパケットの再配置にCPUサイクルの大半を消費します。パケットの再中継には、深刻なメモリコピーのオーバーヘッドと、TCPスタックとアプリケーション間の頻繁なコンテキストスイッチが発生します。splice()のようなゼロコピーAPIは、オーバーヘッドを軽減することができますが、ホストメモリとNIC間のDMAオペレーションは、カーネルバイパスTCPスタックであっても避けることができません。 Table 2 shows the 1-core performance of a simple L7 proxy on mTCP with 16k persistent connections (8k connections for clients-to-proxy and proxy-to-back- end servers, respectively). The proxy exchanges n-byte (n=64 or 1500) packets between two connections, and we measure the wire-level throughput at clients includ- ing control packets. We observe that TCP operations in the proxy significantly degrade the performance by 3.2x to 6.3x compared to simple packet forwarding with DPDK [10], despite using zero-copy splice(). Moreover, DMA operations further degrade the performance by 3.8x for small packets. 表2は、16k個の持続的接続を持つmTCP上の単純なL7プロキシの1コアの性能を示している（クライアントからプロキシ、プロキシからバックエンドサーバにそれぞれ8k個の接続）。プロキシは2つのコネクション間でnバイト（n=64または1500）のパケットを交換し、制御パケットを含むクライアントのワイヤレベルのスループットを測定する。その結果，プロキシでのTCP操作は，ゼロコピーのsplice()を使用しているにもかかわらず，DPDK[10]による単純なパケット転送に比べて3.2倍から6.3倍も性能が大幅に劣化することがわかった．さらに，DMA操作は，小さなパケットに対して3.8倍もの性能低下をもたらした． Summary: We confirm that connection management and packet relaying consume a large amount of CPU cy- cles, severely limiting the application-level performance. Offloading these operations to NIC promises a large potential for performance improvement. 概要：接続管理やパケット中継には大量のCPUサイクリックを消費し、アプリケーションレベルの性能を著しく低下させることが確認されています。これらの処理をNICにオフロードすることで、性能向上の大きな可能性がある。 ### 2.2 NIC Offload of TCP Features There have been a large number of works and debates on NIC offloading of TCP features [35, 47, 50, 57]. While AccelTCP pursues the same benefit of saving CPU cycles and memory bandwidth, it targets a different class of applications neglected by existing schemes. TCP機能のNICオフロード化については、数多くの研究や議論がなされています[35, 47, 50, 57]。AccelTCPは、CPUサイクルとメモリ帯域を節約するという同じ利点を追求していますが、既存のスキームでは無視されていた異なるクラスのアプリケーションを対象としています。 Partial TCP offload: Modern NICs typically support partial, fixed TCP function offloads such as TCP/IP checksum calculation, TCP segmentation offload (TSO), and large receive offload (LRO). These significantly save CPU cycles for processing large messages as they avoid scanning packet payload and reduce the number of in- terrupts to handle. TSO and LRO also improve the DMA throughput as they cut down the DMA setup cost re- quired to deliver many small packets. However, their performance benefit is mostly limited to large data trans- fer as short-lived transactions deal with only a few of small packets. 部分的な TCP オフロード。最近のNICは、TCP/IPチェックサム計算、TCPセグメンテーションオフロード（TSO）、ラージレシーブオフロード（LRO）など、部分的に固定されたTCP機能のオフロードをサポートしています。これらは、パケットのペイロードをスキャンせず、処理するインテラプトの数を減らすことで、大きなメッセージを処理する際のCPUサイクルを大幅に節約します。また、TSOとLROは、多数の小さなパケットを配信するために必要なDMAセットアップコストを削減することで、DMAのスループットを向上させます。しかし、短命なトランザクションでは数個の小パケットしか扱わないため、これらの性能向上は大規模なデータ転送に限られます。 Full Stack offload: TCP Offload Engine (TOE) takes a more ambitious approach that offloads entire TCP processing to NIC [34, 67]. Similar to our work, TOE eliminates the CPU cycles and DMA overhead of con- nection management. It also avoids the DMA transfer of small ACK packets as it manages socket buffers on NIC. Unfortunately, full stack TOE is unpopular in practice as it requires invasive modification of the kernel stack and the compute resource on NIC is limited [12]. Also, oper- ational flexibility is constrained as it requires firmware update to fix bugs or to replace algorithms like conges- tion control or to add new TCP options. Microsoft’s TCP Chimney [15] deviates from the full stack TOE as the kernel stack controls all connections while it offloads only data transfer to the NIC. However, it suffers from similar limitations that arise as the NIC implements TCP data transfer (e.g., flow reassembly, congestion and flow control, buffer management). As a result, it is rarely en- abled these days [27]. フルスタックのオフロード。TCP Offload Engine (TOE) は，TCP 処理全体を NIC にオフロードするという，より野心的なアプローチをとっています [34, 67]．TOEは、我々の研究と同様に、接続管理のCPUサイクルとDMAのオーバーヘッドを排除します。また、NIC上のソケットバッファを管理することで、小さなACKパケットのDMA転送も回避できます。残念ながら、フルスタックTOEは、カーネルスタックの侵襲的な変更を必要とし、NIC上の計算資源が限られているため、実際には普及していません[12]。また、バグを修正したり、輻輳制御などのアルゴリズムを変更したり、新しいTCPオプションを追加したりするために、ファームウェアの更新が必要になるため、運用上の柔軟性にも制約があります。Microsoft社のTCP Chimney [15]は、カーネルスタックがすべての接続を制御する一方で、データ転送のみをNICにオフロードするため、フルスタックTOEから外れています。しかし、この製品は、NICがTCPデータ転送を実装する際に発生する同様の制限（フローの再構築、輻輳およびフロー制御、バッファ管理など）に悩まされます。そのため、最近ではほとんど使用されていません[27]。 In comparison, existing schemes mainly focus on effi- cient large data transfer, but AccelTCP targets perfor- mance improvement with short-lived connections and L7 proxying. AccelTCP is complementary to existing partial TCP offloads as it still exploits them for large data transfer. Similar to TCP Chimney, AccelTCP’s host stack assumes full control of the connections. However, the main offloading task is completely the opposite: Ac- celTCP offloads connection management while the host stack implements entire TCP data transfer. This design substantially reduces the complexity on NIC while it extends the benefit to an important class of modern applications. これに比べて、既存のスキームは主に効率的な大容量データ転送に焦点を当てていますが、AccelTCPは短命の接続やL7プロキシでのパフォーマンス向上を目標としています。AccelTCPは、既存の部分的なTCPオフロードを補完するもので、大容量データの転送にも利用されています。TCP Chimneyと同様に、AccelTCPのホストスタックは、接続の完全な制御を行います。しかし、主なオフロード・タスクは全く逆である。Ac-celTCPは接続管理をオフロードし、ホストスタックはTCPデータ転送全体を実装します。この設計により、NICの複雑さが大幅に軽減されると同時に、重要なクラスの最新アプリケーションにもメリットがもたらされます。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/fb623469-03cb-4193-b64f-b847fbf4cbf9/ScreenShot_2021-03-24_at_11.06.34.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/fb623469-03cb-4193-b64f-b847fbf4cbf9/ScreenShot_2021-03-24_at_11.06.34.png) ### 2.3 Smart NIC for Stateful Offload Smart NICs [1, 3, 14, 25] are gaining popularity as they support flexible packet processing at high speed with programming languages like C or P4 [33]. Recent smart NICs are flexible enough to run Open vSwitch [62], Berkeley packet filter [49], or even key- value lookup [53], often achieving 2x to 3x performance improvement over CPU-based solutions [16]. In this work, we use Netronome Agilio LX as a smart NIC plat- form to offload stateful TCP operations. CやP4などのプログラミング言語で柔軟なパケット処理を高速にサポートするスマートNIC[1, 3, 14, 25]が人気を集めています[33]。最近のスマートNICは，Open vSwitch[62]やBerkeleyのパケットフィルタ[49]，さらにはキーバリュールックアップ[53]などを実行できる柔軟性を備えており，しばしばCPUベースのソリューションに比べて2～3倍の性能向上を実現しています[16]．本研究では、Netronome Agilio LXをスマートNICのプラットフォームとして使用し、ステートフルなTCP操作をオフロードします。 As shown in Figure 2, Agilio LX employs 120 flow processing cores (FPCs) running at 1.2GHz. 36 FPCs are dedicated to special operations (e.g., PCI or Inter- laken) while remaining 84 FPCs can be used for arbitrary packet processing programmed in C and P4. One can im- plement the basic forwarding path with a match-action table in P4 and add custom actions that require a fine- grained logic written in C. The platform also provides fast hashing, checksum calculation, and cryptographic operations implemented in hardware. 図2に示すように、Agilio LXには、1.2GHzで動作する120のフロープロセッシングコア（FPC）が搭載されています。36枚のFPCはPCIやInter lakenなどの特殊な動作に特化しており、残りの84枚のFPCはCやP4でプログラムされた任意のパケット処理に使用できる。基本的なフォワーディングパスをP4のマッチアクションテーブルで実装し、C言語で書かれた細かいロジックを必要とするカスタムアクションを追加することができます。 One drastic difference from general-purpose CPU is that FPCs have multiple layers of non-uniform memory access subsystem – registers and memory local to each FPC, shared memory for a cluster of FPCs called "island", or globally-accessible memory by all FPCs. Memory ac- cess latency ranges from 1 to 500 cycles depending on the location, where access to smaller memory tends to be faster than larger ones. We mainly use internal memory (IMEM, 8MB of SRAM) for flow metadata and external memory (EMEM, 8GB of DRAM) for packet contents. Depending on the flow metadata size, IMEM can sup- port up to 128K to 256K concurrent flows. While EMEM would support more flows, it is 2.5x slower. Each FPC can run up to 8 cooperative threads – access to slow memory by one thread would trigger a hardware-based context switch to another, which takes only 2 cycles. This hides memory access latency similarly to GPU. 汎用CPUとの大きな違いは、FPCが、各FPCのローカルなレジスタやメモリ、アイランドと呼ばれるFPC群の共有メモリ、あるいは全てのFPCからアクセス可能なグローバルなメモリなど、複数の層の不均一なメモリアクセスサブシステムを持っていることである。メモリアクセスのレイテンシーは、場所によって1～500サイクルの範囲で、小さいメモリへのアクセスは大きいメモリよりも速い傾向があります。フローのメタデータには主に内部メモリ（IMEM、8MBのSRAM）を、パケットのコンテンツには外部メモリ（EMEM、8GBのDRAM）を使用しています。フローのメタデータのサイズにもよりますが、IMEMは最大で128K～256Kの同時接続フローをサポートします。EMEMは、より多くのフローをサポートする一方で、2.5倍の速度で動作します。各FPCは最大8つの協調スレッドを実行することができます。あるスレッドが低速のメモリにアクセスすると、ハードウェアベースで別のスレッドにコンテキストスイッチが行われますが、これはわずか2サイクルしかかかりません。これにより、GPUと同様にメモリアクセスのレイテンシーが隠蔽されます。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f2977ec3-0a4c-4704-8460-f7dfc395ed77/ScreenShot_2021-03-24_at_11.08.19.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f2977ec3-0a4c-4704-8460-f7dfc395ed77/ScreenShot_2021-03-24_at_11.08.19.png) Figure 3 shows the packet forwarding performance of Agilio LX as a function of cycles spent by custom C code, where L3 forwarding is implemented in P4. We see that it achieves the line rate (40 Gbps) for any pack- ets larger than 128B. However, 64B packet forwarding throughput is only 42.9 Mpps (or 28.8 Gbps) even with- out any custom code. We suspect the bottleneck lies in scattering and gathering of packets across the FPCs. The performance starts to drop as the custom code spends more than 200 cycles, so minimizing cycle consumption on NIC is critical for high performance. 図3は、P4にL3フォワーディングを実装したAgilio LXのパケットフォワーディング性能を、カスタムCコードが費やすサイクル数の関数として示したものです。128B以上のパケットでは、ラインレート（40Gbps）を達成していることがわかる。しかし、64Bのパケット転送では、カスタムコードを使用しなくても42.9Mpps（または28.8Gbps）のスループットしか得られません。これは、FPCにパケットを散らしたり集めたりすることがボトルネックになっているのではないかと考えられます。カスタムコードが200サイクル以上消費されると性能が低下し始めるため、NICのサイクル消費を最小限に抑えることが高性能を実現するためには重要です。