AI Systems Performance Engineering - yuuk1's Digital Garden

[AI Systems Performance Engineering \[Book\]](https://www.oreilly.com/library/view/ai-systems-performance/9798341627772/) [GitHub - cfregly/ai-performance-engineering](https://github.com/cfregly/ai-performance-engineering) ## Overview AIシステムのパフォーマンス能力を飛躍的に向上させるための決定版ガイド。AIインフラストラクチャのあらゆるレイヤーでピーク効率を解き放つ方法をご紹介します。生成モデルが急速に進化する現代において、AIシステムパフォーマンスエンジニアリングは、プロフェッショナルがハードウェア、ソフトウェア、アルゴリズムを最適化し、高性能かつコスト効果の高いAIシステムを実現するための実践的な戦略を提供します。パフォーマンスに焦点を当てたエンジニアリングとプロダクトリーダーであるChris Freglyが執筆したこの包括的なリソースは、複雑なシステムを効率的で影響力のあるAIソリューションに変革します。本書では、GPU CUDAカーネルの微調整、PyTorchベースのアルゴリズム、マルチノードトレーニングと推論システムのステップバイステップ手法を学びます。また、GPUクラスターのスケールアップ、分散型モデルトレーニングジョブ、推論サーバーの最適化をマスターします。ハードウェア、ソフトウェア、アルゴリズムを共同設計し最適化して、最大のスループットとコスト削減を実現現実の環境で遅延を削減しスループットを向上させる最先端の推論戦略を実装業界をリードするスケーラビリティツールとフレームワークを活用複雑なAIパイプライン全体でパフォーマンスのボトルネックをプロファイリング、診断、排除堅牢で信頼性の高いAIシステムパフォーマンスを実現するためのフルスタック最適化技術を統合エンジニア、研究者、開発者いずれの立場でも、AI Systems Performance Engineeringは、トレーニングと推論の両方で優れた性能を発揮する、堅牢でスケーラブルかつコスト効果の高いAIシステムを構築するための包括的なロードマップを提供します。 # 📖 Book Chapters Overview ### **Chapter 1: Introduction and AI System Overview** - The AI Systems Performance Engineer - Benchmarking and Profiling - Scaling Distributed Training and Inference - Managing Resources Efficiently - Cross-Team Collaboration - Transparency and Reproducibility ### **Chapter 2: AI System Hardware Overview** - The CPU and GPU "Superchip" - NVIDIA Grace CPU & Blackwell GPU - NVIDIA GPU Tensor Cores and Transformer Engine - Streaming Multiprocessors, Threads, and Warps - Ultra-Scale Networking - [[NVLink]] and [[NVSwitch]] - Multi-GPU Programming ### **Chapter 3: OS, Docker, and Kubernetes Tuning** - Operating System Configuration - GPU Driver and Software Stack - NUMA Awareness and CPU Pinning - Container Runtime Optimizations - [[Kubernetes]] for Topology-Aware Orchestration - Memory Isolation and Resource Management ### **Chapter 4: Tuning Distributed Networking Communication** - Overlapping Communication and Computation - [[NCCL]] for Distributed Multi-GPU Communication - Topology Awareness in NCCL - Distributed Data Parallel Strategies - NVIDIA Inference Transfer Library (NIXL) - In-Network [[SHARP]] Aggregation ### **Chapter 5: GPU-based Storage I/O Optimizations** - Fast Storage and Data Locality - NVIDIA GPUDirect Storage - Distributed, Parallel File Systems - Multi-Modal Data Processing with NVIDIA DALI - Creating High-Quality LLM Datasets ### **Chapter 6: GPU Architecture, CUDA Programming, and Maximizing Occupancy** - Understanding GPU Architecture - Threads, Warps, Blocks, and Grids - CUDA Programming Refresher - Understanding GPU Memory Hierarchy - Maintaining High Occupancy and GPU Utilization - Roofline Model Analysis ### **Chapter 7: Profiling and Tuning GPU Memory Access Patterns** - Coalesced vs. Uncoalesced Global Memory Access - Vectorized Memory Access - Tiling and Data Reuse Using Shared Memory - Warp Shuffle Intrinsics - Asynchronous Memory Prefetching ### **Chapter 8: Occupancy Tuning, Warp Efficiency, and Instruction-Level Parallelism** - Profiling and Diagnosing GPU Bottlenecks - Nsight Systems and Compute Analysis - Tuning Occupancy - Improving Warp Execution Efficiency - Exposing Instruction-Level Parallelism ### **Chapter 9: Increasing CUDA Kernel Efficiency and Arithmetic Intensity** - Multi-Level Micro-Tiling - Kernel Fusion - Mixed Precision and Tensor Cores - Using CUTLASS for Optimal Performance - Inline PTX and SASS Tuning ### **Chapter 10: Intra-Kernel Pipelining and Cooperative Thread Block Clusters** - Intra-Kernel Pipelining Techniques - Warp-Specialized Producer-Consumer Model - Persistent Kernels and Megakernels - Thread Block Clusters and Distributed Shared Memory - Cooperative Groups ### **Chapter 11: Inter-Kernel Pipelining and CUDA Streams** - Using Streams to Overlap Compute with Data Transfers - Stream-Ordered Memory Allocator - Fine-Grained Synchronization with Events - Zero-Overhead Launch with CUDA Graphs ### **Chapter 12: Dynamic and Device-Side Kernel Orchestration** - Dynamic Scheduling with Atomic Work Queues - Batch Repeated Kernel Launches with CUDA Graphs - Dynamic Parallelism - Orchestrate Across Multiple GPUs with NVSHMEM ### **Chapter 13: Profiling, Tuning, and Scaling PyTorch** - NVTX Markers and Profiling Tools - PyTorch Compiler (torch.compile) - Profiling and Tuning Memory in PyTorch - Scaling with PyTorch Distributed - Multi-GPU Profiling with HTA ### **Chapter 14: PyTorch Compiler, XLA, and OpenAI Triton Backends** - PyTorch Compiler Deep Dive - Writing Custom Kernels with OpenAI Triton - PyTorch XLA Backend - Advanced Triton Kernel Implementations ### **Chapter 15: Multi-Node Inference Parallelism and Routing** - Disaggregated Prefill and Decode Architecture - Parallelism Strategies for MoE Models - Speculative and Parallel Decoding Techniques - Dynamic Routing Strategies ### **Chapter 16: Profiling, Debugging, and Tuning Inference at Scale** - Workflow for Profiling and Tuning Performance - Dynamic Request Batching and Scheduling - Systems-Level Optimizations - Quantization Approaches for Real-Time Inference - Application-Level Optimizations ### **Chapter 17: Scaling Disaggregated Prefill and Decode** - Prefill-Decode Disaggregation Benefits - Prefill Workers Design - Decode Workers Design - Disaggregated Routing and Scheduling Policies - Scalability Considerations ### **Chapter 18: Advanced Prefill-Decode and KV Cache Tuning** - Optimized Decode Kernels (FlashMLA, ThunderMLA, FlexDecoding) - Tuning KV Cache Utilization and Management - Heterogeneous Hardware and Parallelism Strategies - SLO-Aware Request Management ### **Chapter 19: Dynamic and Adaptive Inference Engine Optimizations** - Adaptive Parallelism Strategies - Dynamic Precision Changes - Kernel Auto-Tuning - Reinforcement Learning Agents for Runtime Tuning - Adaptive Batching and Scheduling ### **Chapter 20: AI-Assisted Performance Optimizations** - AlphaTensor AI-Discovered Algorithms - Automated GPU Kernel Optimizations - Self-Improving AI Agents - Scaling Toward Multi-Million GPU Clusters