ハードウェアカウンタ - yuuk1's Digital Garden

# ハードウェアカウンタ ## 定義ハードウェアカウンタ(PMC: Performance Monitoring Counter)とは、CPU/GPU が内蔵する計数器で、キャッシュミス・メモリアクセス・CPU サイクル・stall サイクル・ページフォルトなどのマイクロアーキテクチャ事象を低コストで計測する。性能ボトルネックがメモリバウンドか計算バウンドかの判別に使われる。([[@2026__arXiv__ProfInfer - An eBPF-based Fine-Grained LLM Inference Profiler]]) GPU では内部カウンタが限られ、ホスト側のソフトウェア計装と組み合わせて使う場面が多い。 ## 横断的知見 - **%CPU と PMC の乖離: IPC が真の仕事量を示す**: [[Brendan Gregg]](2017) は `perf stat` の `cycles` / `instructions` から算出される IPC(Instructions Per Cycle)が、%CPU では見えないメモリバウンドと命令バウンドを区別することを実測で示した。IPC < 1.0 はメモリストールを、IPC ≥ 1.0 は命令律速を示す。(Source: [[@2017__brendangregg.com__CPU Utilization is Wrong]]) - **演算子レベルに PMC を結び付けるとボトルネックの種別が判定できる**: [[@2026__arXiv__ProfInfer - An eBPF-based Fine-Grained LLM Inference Profiler]] は `l3d_cache_refill`(メモリ読み出し量)・`mem_access_wr`・`cycles`・`idle-backend-cycles`(stall)などの PMC を演算子の uprobe/uretprobe 間で差分計測し、行列ベクトル乗算で CPU サイクルの 50% 超が stall するメモリ帯域ボトルネックを定量化する。PMC は単独でなく演算子・テンソル次元と対応づけて初めて診断に効く。(Source: [[@2026__arXiv__ProfInfer - An eBPF-based Fine-Grained LLM Inference Profiler]]) - **既存ツールはカウンタ/PC サンプリング中心でホスト視点に偏る**: [[@2024__TOPC__Low-Overhead Trace Collection and Profiling on GPU Compute Kernels]] は、ROC-profiler・VTune・HPCToolkit 等がカウンタや PC サンプリング中心でホスト視点に寄るのに対し、デバイス上のトレース収集に踏み込む。カウンタによる統計とデバイス計装によるトレースは相補的で、両者の統合が課題として浮かぶ。(Source: [[@2024__TOPC__Low-Overhead Trace Collection and Profiling on GPU Compute Kernels]]) - **GPU 内部情報はカウンタ依存が残る**: [[@2025__HCDS__eGPU - Extending eBPF Programmability and Observability to GPUs]]・[[@2025__eBPF__eInfer - Unlocking Fine-Grained Tracing for Distributed LLM Inference with eBPF]] はホスト側 eBPF 計装でも SM 利用率などの GPU マイクロアーキテクチャ情報は依然ハードウェアカウンタに依存すると認める。ソフト計装が万能でないことが、PMC の役割を残している([[GPU観測性]])。(Source: [[@2025__HCDS__eGPU - Extending eBPF Programmability and Observability to GPUs]], [[@2025__eBPF__eInfer - Unlocking Fine-Grained Tracing for Distributed LLM Inference with eBPF]]) ## 未解決の問い - ソフトウェア計装(LD/ST フック・uprobe)で、限られた GPU ハードウェアカウンタをどこまで代替できるか。 - GPU/NPU 側 PMC とホスト側 eBPF 計装をどの程度統合できるか(ProfInfer は CPU/OpenCL/NPU を扱うが PMC の統合度は限定的)。 - カウンタ統計(ホスト視点)とデバイス上トレースを結合したとき、どんな相補的洞察が得られるか。 ## 関連 - ソース: [[@2017__brendangregg.com__CPU Utilization is Wrong]] / [[@2026__arXiv__ProfInfer - An eBPF-based Fine-Grained LLM Inference Profiler]] / [[@2024__TOPC__Low-Overhead Trace Collection and Profiling on GPU Compute Kernels]] / [[@2025__HCDS__eGPU - Extending eBPF Programmability and Observability to GPUs]] / [[@2025__eBPF__eInfer - Unlocking Fine-Grained Tracing for Distributed LLM Inference with eBPF]] - 概念: [[CPU利用率]] / [[Instructions Per Cycle]] / [[GPU観測性]] / [[動的計装]] / [[LLM推論]] / [[テレメトリ]] - エンティティ: [[CUPTI]] / [[NVBit]] / [[hip-analyzer]] - 関連 MOC: [[AI Infra Telemetry - MOC]] ## 出典 - [[@2017__brendangregg.com__CPU Utilization is Wrong]](IPC による CPU-メモリボトルネック診断・%CPU の誤謬) - [[@2026__arXiv__ProfInfer - An eBPF-based Fine-Grained LLM Inference Profiler]](演算子別 PMC・stall によるメモリ帯域ボトルネック判定) - [[@2024__TOPC__Low-Overhead Trace Collection and Profiling on GPU Compute Kernels]](カウンタ中心ツールとデバイス計装の対比)