[AI Systems Performance Engineering \[Book\]](https://www.oreilly.com/library/view/ai-systems-performance/9798341627772/) [GitHub - cfregly/ai-performance-engineering](https://github.com/cfregly/ai-performance-engineering) ## Overview AIシステムのパフォーマンス能力を飛躍的に向上させるための決定版ガイド。AIインフラストラクチャのあらゆるレイヤーでピーク効率を解き放つ方法をご紹介します。生成モデルが急速に進化する現代において、AIシステムパフォーマンスエンジニアリングは、プロフェッショナルがハードウェア、ソフトウェア、アルゴリズムを最適化し、高性能かつコスト効果の高いAIシステムを実現するための実践的な戦略を提供します。パフォーマンスに焦点を当てたエンジニアリングとプロダクトリーダーであるChris Freglyが執筆したこの包括的なリソースは、複雑なシステムを効率的で影響力のあるAIソリューションに変革します。 本書では、GPU CUDAカーネルの微調整、PyTorchベースのアルゴリズム、マルチノードトレーニングと推論システムのステップバイステップ手法を学びます。また、GPUクラスターのスケールアップ、分散型モデルトレーニングジョブ、推論サーバーの最適化をマスターします。 ハードウェア、ソフトウェア、アルゴリズムを共同設計し最適化して、最大のスループットとコスト削減を実現 現実の環境で遅延を削減しスループットを向上させる最先端の推論戦略を実装 業界をリードするスケーラビリティツールとフレームワークを活用 複雑なAIパイプライン全体でパフォーマンスのボトルネックをプロファイリング、診断、排除 堅牢で信頼性の高いAIシステムパフォーマンスを実現するためのフルスタック最適化技術を統合 エンジニア、研究者、開発者いずれの立場でも、AI Systems Performance Engineeringは、トレーニングと推論の両方で優れた性能を発揮する、堅牢でスケーラブルかつコスト効果の高いAIシステムを構築するための包括的なロードマップを提供します。 # 📖 Book Chapters Overview ### **Chapter 1: Introduction and AI System Overview** - The AI Systems Performance Engineer - Benchmarking and Profiling - Scaling Distributed Training and Inference - Managing Resources Efficiently - Cross-Team Collaboration - Transparency and Reproducibility ### **Chapter 2: AI System Hardware Overview** - The CPU and GPU "Superchip" - NVIDIA Grace CPU & Blackwell GPU - NVIDIA GPU Tensor Cores and Transformer Engine - Streaming Multiprocessors, Threads, and Warps - Ultra-Scale Networking - [[NVLink]] and [[NVSwitch]] - Multi-GPU Programming ### **Chapter 3: OS, Docker, and Kubernetes Tuning** - Operating System Configuration - GPU Driver and Software Stack - NUMA Awareness and CPU Pinning - Container Runtime Optimizations - [[Kubernetes]] for Topology-Aware Orchestration - Memory Isolation and Resource Management ### **Chapter 4: Tuning Distributed Networking Communication** - Overlapping Communication and Computation - [[NCCL]] for Distributed Multi-GPU Communication - Topology Awareness in NCCL - Distributed Data Parallel Strategies - NVIDIA Inference Transfer Library (NIXL) - In-Network [[SHARP]] Aggregation ### **Chapter 5: GPU-based Storage I/O Optimizations** - Fast Storage and Data Locality - NVIDIA GPUDirect Storage - Distributed, Parallel File Systems - Multi-Modal Data Processing with NVIDIA DALI - Creating High-Quality LLM Datasets ### **Chapter 6: GPU Architecture, CUDA Programming, and Maximizing Occupancy** - Understanding GPU Architecture - Threads, Warps, Blocks, and Grids - CUDA Programming Refresher - Understanding GPU Memory Hierarchy - Maintaining High Occupancy and GPU Utilization - Roofline Model Analysis ### **Chapter 7: Profiling and Tuning GPU Memory Access Patterns** - Coalesced vs. Uncoalesced Global Memory Access - Vectorized Memory Access - Tiling and Data Reuse Using Shared Memory - Warp Shuffle Intrinsics - Asynchronous Memory Prefetching ### **Chapter 8: Occupancy Tuning, Warp Efficiency, and Instruction-Level Parallelism** - Profiling and Diagnosing GPU Bottlenecks - Nsight Systems and Compute Analysis - Tuning Occupancy - Improving Warp Execution Efficiency - Exposing Instruction-Level Parallelism ### **Chapter 9: Increasing CUDA Kernel Efficiency and Arithmetic Intensity** - Multi-Level Micro-Tiling - Kernel Fusion - Mixed Precision and Tensor Cores - Using CUTLASS for Optimal Performance - Inline PTX and SASS Tuning ### **Chapter 10: Intra-Kernel Pipelining and Cooperative Thread Block Clusters** - Intra-Kernel Pipelining Techniques - Warp-Specialized Producer-Consumer Model - Persistent Kernels and Megakernels - Thread Block Clusters and Distributed Shared Memory - Cooperative Groups ### **Chapter 11: Inter-Kernel Pipelining and CUDA Streams** - Using Streams to Overlap Compute with Data Transfers - Stream-Ordered Memory Allocator - Fine-Grained Synchronization with Events - Zero-Overhead Launch with CUDA Graphs ### **Chapter 12: Dynamic and Device-Side Kernel Orchestration** - Dynamic Scheduling with Atomic Work Queues - Batch Repeated Kernel Launches with CUDA Graphs - Dynamic Parallelism - Orchestrate Across Multiple GPUs with NVSHMEM ### **Chapter 13: Profiling, Tuning, and Scaling PyTorch** - NVTX Markers and Profiling Tools - PyTorch Compiler (torch.compile) - Profiling and Tuning Memory in PyTorch - Scaling with PyTorch Distributed - Multi-GPU Profiling with HTA ### **Chapter 14: PyTorch Compiler, XLA, and OpenAI Triton Backends** - PyTorch Compiler Deep Dive - Writing Custom Kernels with OpenAI Triton - PyTorch XLA Backend - Advanced Triton Kernel Implementations ### **Chapter 15: Multi-Node Inference Parallelism and Routing** - Disaggregated Prefill and Decode Architecture - Parallelism Strategies for MoE Models - Speculative and Parallel Decoding Techniques - Dynamic Routing Strategies ### **Chapter 16: Profiling, Debugging, and Tuning Inference at Scale** - Workflow for Profiling and Tuning Performance - Dynamic Request Batching and Scheduling - Systems-Level Optimizations - Quantization Approaches for Real-Time Inference - Application-Level Optimizations ### **Chapter 17: Scaling Disaggregated Prefill and Decode** - Prefill-Decode Disaggregation Benefits - Prefill Workers Design - Decode Workers Design - Disaggregated Routing and Scheduling Policies - Scalability Considerations ### **Chapter 18: Advanced Prefill-Decode and KV Cache Tuning** - Optimized Decode Kernels (FlashMLA, ThunderMLA, FlexDecoding) - Tuning KV Cache Utilization and Management - Heterogeneous Hardware and Parallelism Strategies - SLO-Aware Request Management ### **Chapter 19: Dynamic and Adaptive Inference Engine Optimizations** - Adaptive Parallelism Strategies - Dynamic Precision Changes - Kernel Auto-Tuning - Reinforcement Learning Agents for Runtime Tuning - Adaptive Batching and Scheduling ### **Chapter 20: AI-Assisted Performance Optimizations** - AlphaTensor AI-Discovered Algorithms - Automated GPU Kernel Optimizations - Self-Improving AI Agents - Scaling Toward Multi-Million GPU Clusters