AI Networks - RoCEv2 and the role of "netdev"

## Chairs David Ahern Leon Romanovsky ## Label Nuts and Bolts ## Session Type Workshop ## Contents - [slides](https://netdevconf.info/0x19/docs/netdev-0x19-paper18-talk-slides/netdev-0x19-AI-networking-RoCE-and-netdev.pdf) ## Description AI training requires high bandwidth and low latency networks and networking stacks as training times are highly dependent on tail latency. Collective communications such as NCCL can use socket based designs (e.g., TCP/IP) or RDMA operations to move training data between servers. The amount of data to be moved between nodes has exploded with larger LLM sizes. That volume of data along with the increasing speeds of ethernet (800G as state of the art) emphasize the inefficiencies of socket based networking. A data path with packets traversing a full networking stack has entirely too much overhead that affects both throughput and latency to compete with RDMA and implementations like RoCEv2. So what does that mean for “netdev”? While RDMA operations are used to efficiently move data, the RoCEv2 protocol itself is based on standard networking protocols (UDP/IP/ethernet), and the core infiniband S/W stack leverages the Linux networking stack (aka, “netdev”) where possible. In this workshop, we will discuss the RoCEv2 protocol for AI training networks and the role of “netdev”. This supporting role includes the netdev device model with operations for H/W offloads, port state (mtu and carrier), and network addresses and as well as routing and neighbor resolution. The socket based stack is also used for out-of-band communications (e.g., exchanging metadata). We will also revisit the solution presented at netdev 0x16 that shows how to connect Linux TCP with QPs to avoid the traditional overhead of sockets and full-stack traversal to improve performance while re-using the advantages of TCP and its congestion control protocols. Finally, we will review the recent contributions to use device memory with Linux TCP, the related io\_uring work and what that means for performance relative to RoCEv2. --- AIのトレーニングには、高帯域幅と低遅延のネットワークおよびネットワークスタックが不可欠です。トレーニング時間はテール遅延に大きく依存するためです。[[NCCL]]のような集団通信では、サーバー間でトレーニングデータを移動するために、ソケットベースの設計（例：TCP/IP）や[[RDMA]]操作を使用できます。ノード間で移動する必要のあるデータ量は、[[LLM]]のサイズが大きくなるにつれて爆発的に増加しています。このデータ量に加え、イーサネットの速度向上（800Gが最先端）は、ソケットベースのネットワークの非効率性を浮き彫りにしています。パケットがネットワークスタック全体を traversing するデータパスは、スループットと遅延の両方に影響を与える過剰なオーバーヘッドを抱えており、RDMAやRoCEv2のような実装と競合できません。では、これは「netdev」にとって何を意味するのでしょうか？RDMA操作はデータを効率的に移動するために使用されますが、[[RoCE]]プロトコル自体は標準的なネットワークプロトコル（UDP/IP/イーサネット）を基盤とし、コアのインフィニバンドソフトウェアスタックは可能な限りLinuxネットワークスタック（いわゆる「netdev」）を活用しています。このワークショップでは、AI トレーニングネットワーク向けの RoCEv2 プロトコルと「netdev」の役割について議論します。このサポート役割には、H/W オフロード、ポート状態（mtu およびキャリア）、ネットワークアドレス、ルーティング、ネイバー解決のための操作を備えた netdev デバイスモデルが含まれます。ソケットベースのスタックは、帯域外通信（メタデータの交換など）にも使用されます。また、netdev 0x16 で発表された、Linux TCP を QP に接続して、従来のソケットのオーバーヘッドとフルスタックトラバーサルを回避し、TCP およびその輻輳制御プロトコルの利点を再利用しながらパフォーマンスを向上させる方法についても再確認します。最後に、Linux TCP でデバイスメモリを使用するための最近の貢献、関連する io\_uring の作業、および RoCEv2 に対するパフォーマンス上の意味について確認します。