[https://conferences.sigcomm.org/sigcomm/2025/accepted-papers/](https://conferences.sigcomm.org/sigcomm/2025/accepted-papers/) - Low-Overhead Distributed Application Observation with DeepTrace: Achieving Accurate Tracing in Production Systems - Software-based Live Migration for RDMA - Hawkeye: Diagnosing RDMA Network Performance Anomalies with PFC Provenance - Towards LLM-Based Failure Localization in Production-Scale Networks - SkeletonHunter: Diagnosing and Localizing Network Failures in Containerized Large Model Training - ByteTracker: An Agentless and Real-time Path-aware Network Probing System - Astral: A Datacenter Infrastructure for Large Language Model Training at Scale - SkyNet: Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures - Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework - Alibaba Stellar: A New Generation RDMA Network for Cloud AI - ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs SIGCOMM Workshop Keynote - 09:00 — 09:55 | Keynote 1: Improving the Performance and Resiliency for Large-Scale Distributed Training | Minlan Yu (Harvard University) [https://conferences.sigcomm.org/sigcomm/2025/workshop/naic/](https://conferences.sigcomm.org/sigcomm/2025/workshop/naic/) - 14:05 — 14:55 | Keynote 2: Cross-Layer Innovations in Network Design for AI at Meta | Ying Zhang (Meta) papers - Toward eBPF-Accelerated Pub-Sub Systems (Full) - Empowering machine-learning assisted kernel decisions with eBPF^ML (Short) - eInfer: Unlocking Fine-Grained Tracing for Distributed LLM Inference with eBPF (Full) - NSX: Large-Scale Network Simulation on an AI Server - MLSynth: Towards Synthetic ML Traces - Simulating LLM training workloads for heterogeneous compute and network infrastructure (Short) - RTT- or Bandwidth-Bound? Demystifying the KV Cache Transfer in Large Language Model Serving (Short) - Intent Fuel Station: A RAG-Enhanced Agent Hub for Realizing Networking Intents (Short) - LIFT: Automating Symbolic Execution Optimization with Large Language Models for AI Networks - Quantifying the Impact of Job Placement and Routing on Network Efficiency in AI Clusters - Towards a Playground to Democratize Experimentation and Benchmarking of AI Agents for Network Troubleshooting - 15:20 — 15:40 | DeepFlow Agent Live Demo: Vibe-Style Microservice Troubleshooting with eBPF & MCP | Yang Xiang