分散学習スループットの相場

[Performance — NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-summary.html) 事前学習 [[H100]]だと、230 ~ 854 Model TFLOP / sec / GPU、320 - 14744 tokens / sec / GPU [GitHub - NVIDIA/Megatron-LM: Ongoing research training transformer models at scale](https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#performance-benchmarking) ![[Pasted image 20250831230508.png]] - [[MFU]]を維持しながらスケールアウトできている [[Strong Scaling and Weak Scaling]] -