[https://conferences.sigcomm.org/sigcomm/2025/accepted-papers/](https://conferences.sigcomm.org/sigcomm/2025/accepted-papers/)
- Low-Overhead Distributed Application Observation with DeepTrace: Achieving Accurate Tracing in Production Systems
- Software-based Live Migration for RDMA
- Hawkeye: Diagnosing RDMA Network Performance Anomalies with PFC Provenance
- Towards LLM-Based Failure Localization in Production-Scale Networks
- SkeletonHunter: Diagnosing and Localizing Network Failures in Containerized Large Model Training
- ByteTracker: An Agentless and Real-time Path-aware Network Probing System
- Astral: A Datacenter Infrastructure for Large Language Model Training at Scale
- SkyNet: Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures
- Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework
- Alibaba Stellar: A New Generation RDMA Network for Cloud AI
- ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs
SIGCOMM Workshop
Keynote
- 09:00 — 09:55 | Keynote 1: Improving the Performance and Resiliency for Large-Scale Distributed Training | Minlan Yu (Harvard University) [https://conferences.sigcomm.org/sigcomm/2025/workshop/naic/](https://conferences.sigcomm.org/sigcomm/2025/workshop/naic/)
- 14:05 — 14:55 | Keynote 2: Cross-Layer Innovations in Network Design for AI at Meta | Ying Zhang (Meta)
papers
- Toward eBPF-Accelerated Pub-Sub Systems (Full)
- Empowering machine-learning assisted kernel decisions with eBPF^ML (Short)
- eInfer: Unlocking Fine-Grained Tracing for Distributed LLM Inference with eBPF (Full)
- NSX: Large-Scale Network Simulation on an AI Server
- MLSynth: Towards Synthetic ML Traces
- Simulating LLM training workloads for heterogeneous compute and network infrastructure (Short)
- RTT- or Bandwidth-Bound? Demystifying the KV Cache Transfer in Large Language Model Serving (Short)
- Intent Fuel Station: A RAG-Enhanced Agent Hub for Realizing Networking Intents (Short)
- LIFT: Automating Symbolic Execution Optimization with Large Language Models for AI Networks
- Quantifying the Impact of Job Placement and Routing on Network Efficiency in AI Clusters
- Towards a Playground to Democratize Experimentation and Benchmarking of AI Agents for Network Troubleshooting
-
15:20 — 15:40 | DeepFlow Agent Live Demo: Vibe-Style Microservice Troubleshooting with eBPF & MCP | Yang Xiang