[AI Systems Performance Engineering \[Book\]](https://www.oreilly.com/library/view/ai-systems-performance/9798341627772/)
[GitHub - cfregly/ai-performance-engineering](https://github.com/cfregly/ai-performance-engineering)
## Overview
AIシステムのパフォーマンス能力を飛躍的に向上させるための決定版ガイド。AIインフラストラクチャのあらゆるレイヤーでピーク効率を解き放つ方法をご紹介します。生成モデルが急速に進化する現代において、AIシステムパフォーマンスエンジニアリングは、プロフェッショナルがハードウェア、ソフトウェア、アルゴリズムを最適化し、高性能かつコスト効果の高いAIシステムを実現するための実践的な戦略を提供します。パフォーマンスに焦点を当てたエンジニアリングとプロダクトリーダーであるChris Freglyが執筆したこの包括的なリソースは、複雑なシステムを効率的で影響力のあるAIソリューションに変革します。
本書では、GPU CUDAカーネルの微調整、PyTorchベースのアルゴリズム、マルチノードトレーニングと推論システムのステップバイステップ手法を学びます。また、GPUクラスターのスケールアップ、分散型モデルトレーニングジョブ、推論サーバーの最適化をマスターします。
ハードウェア、ソフトウェア、アルゴリズムを共同設計し最適化して、最大のスループットとコスト削減を実現
現実の環境で遅延を削減しスループットを向上させる最先端の推論戦略を実装
業界をリードするスケーラビリティツールとフレームワークを活用
複雑なAIパイプライン全体でパフォーマンスのボトルネックをプロファイリング、診断、排除
堅牢で信頼性の高いAIシステムパフォーマンスを実現するためのフルスタック最適化技術を統合
エンジニア、研究者、開発者いずれの立場でも、AI Systems Performance Engineeringは、トレーニングと推論の両方で優れた性能を発揮する、堅牢でスケーラブルかつコスト効果の高いAIシステムを構築するための包括的なロードマップを提供します。
# 📖 Book Chapters Overview
### **Chapter 1: Introduction and AI System Overview**
- The AI Systems Performance Engineer
- Benchmarking and Profiling
- Scaling Distributed Training and Inference
- Managing Resources Efficiently
- Cross-Team Collaboration
- Transparency and Reproducibility
### **Chapter 2: AI System Hardware Overview**
- The CPU and GPU "Superchip"
- NVIDIA Grace CPU & Blackwell GPU
- NVIDIA GPU Tensor Cores and Transformer Engine
- Streaming Multiprocessors, Threads, and Warps
- Ultra-Scale Networking
- [[NVLink]] and [[NVSwitch]]
- Multi-GPU Programming
### **Chapter 3: OS, Docker, and Kubernetes Tuning**
- Operating System Configuration
- GPU Driver and Software Stack
- NUMA Awareness and CPU Pinning
- Container Runtime Optimizations
- [[Kubernetes]] for Topology-Aware Orchestration
- Memory Isolation and Resource Management
### **Chapter 4: Tuning Distributed Networking Communication**
- Overlapping Communication and Computation
- [[NCCL]] for Distributed Multi-GPU Communication
- Topology Awareness in NCCL
- Distributed Data Parallel Strategies
- NVIDIA Inference Transfer Library (NIXL)
- In-Network [[SHARP]] Aggregation
### **Chapter 5: GPU-based Storage I/O Optimizations**
- Fast Storage and Data Locality
- NVIDIA GPUDirect Storage
- Distributed, Parallel File Systems
- Multi-Modal Data Processing with NVIDIA DALI
- Creating High-Quality LLM Datasets
### **Chapter 6: GPU Architecture, CUDA Programming, and Maximizing Occupancy**
- Understanding GPU Architecture
- Threads, Warps, Blocks, and Grids
- CUDA Programming Refresher
- Understanding GPU Memory Hierarchy
- Maintaining High Occupancy and GPU Utilization
- Roofline Model Analysis
### **Chapter 7: Profiling and Tuning GPU Memory Access Patterns**
- Coalesced vs. Uncoalesced Global Memory Access
- Vectorized Memory Access
- Tiling and Data Reuse Using Shared Memory
- Warp Shuffle Intrinsics
- Asynchronous Memory Prefetching
### **Chapter 8: Occupancy Tuning, Warp Efficiency, and Instruction-Level Parallelism**
- Profiling and Diagnosing GPU Bottlenecks
- Nsight Systems and Compute Analysis
- Tuning Occupancy
- Improving Warp Execution Efficiency
- Exposing Instruction-Level Parallelism
### **Chapter 9: Increasing CUDA Kernel Efficiency and Arithmetic Intensity**
- Multi-Level Micro-Tiling
- Kernel Fusion
- Mixed Precision and Tensor Cores
- Using CUTLASS for Optimal Performance
- Inline PTX and SASS Tuning
### **Chapter 10: Intra-Kernel Pipelining and Cooperative Thread Block Clusters**
- Intra-Kernel Pipelining Techniques
- Warp-Specialized Producer-Consumer Model
- Persistent Kernels and Megakernels
- Thread Block Clusters and Distributed Shared Memory
- Cooperative Groups
### **Chapter 11: Inter-Kernel Pipelining and CUDA Streams**
- Using Streams to Overlap Compute with Data Transfers
- Stream-Ordered Memory Allocator
- Fine-Grained Synchronization with Events
- Zero-Overhead Launch with CUDA Graphs
### **Chapter 12: Dynamic and Device-Side Kernel Orchestration**
- Dynamic Scheduling with Atomic Work Queues
- Batch Repeated Kernel Launches with CUDA Graphs
- Dynamic Parallelism
- Orchestrate Across Multiple GPUs with NVSHMEM
### **Chapter 13: Profiling, Tuning, and Scaling PyTorch**
- NVTX Markers and Profiling Tools
- PyTorch Compiler (torch.compile)
- Profiling and Tuning Memory in PyTorch
- Scaling with PyTorch Distributed
- Multi-GPU Profiling with HTA
### **Chapter 14: PyTorch Compiler, XLA, and OpenAI Triton Backends**
- PyTorch Compiler Deep Dive
- Writing Custom Kernels with OpenAI Triton
- PyTorch XLA Backend
- Advanced Triton Kernel Implementations
### **Chapter 15: Multi-Node Inference Parallelism and Routing**
- Disaggregated Prefill and Decode Architecture
- Parallelism Strategies for MoE Models
- Speculative and Parallel Decoding Techniques
- Dynamic Routing Strategies
### **Chapter 16: Profiling, Debugging, and Tuning Inference at Scale**
- Workflow for Profiling and Tuning Performance
- Dynamic Request Batching and Scheduling
- Systems-Level Optimizations
- Quantization Approaches for Real-Time Inference
- Application-Level Optimizations
### **Chapter 17: Scaling Disaggregated Prefill and Decode**
- Prefill-Decode Disaggregation Benefits
- Prefill Workers Design
- Decode Workers Design
- Disaggregated Routing and Scheduling Policies
- Scalability Considerations
### **Chapter 18: Advanced Prefill-Decode and KV Cache Tuning**
- Optimized Decode Kernels (FlashMLA, ThunderMLA, FlexDecoding)
- Tuning KV Cache Utilization and Management
- Heterogeneous Hardware and Parallelism Strategies
- SLO-Aware Request Management
### **Chapter 19: Dynamic and Adaptive Inference Engine Optimizations**
- Adaptive Parallelism Strategies
- Dynamic Precision Changes
- Kernel Auto-Tuning
- Reinforcement Learning Agents for Runtime Tuning
- Adaptive Batching and Scheduling
### **Chapter 20: AI-Assisted Performance Optimizations**
- AlphaTensor AI-Discovered Algorithms
- Automated GPU Kernel Optimizations
- Self-Improving AI Agents
- Scaling Toward Multi-Million GPU Clusters