集合通信 - yuuk1's Digital Garden

以下では、並列・分散計算における集合通信（collective communication）を、定義、基本動作、歴史的変遷、構成要素技術の順に、学術的観点から体系的に論述する。対象は HPC と大規模 AI 学習を含む広義の高性能計算スタックであり、MPI を中心に GPU/ネットワーク/ミドルウェアまで踏み込む。 ## **1. 定義と形式的枠組み** 集合通信とは、プロセス（あるいはスレッド／PE）の集合に対して、同一の通信操作を協調的に適用する高水準プリミティブである。典型例は、全員が同時に同期点に到達する障壁同期、単一根から全員へのブロードキャスト、全員から単一根への縮約、全員から全員への縮約結果配布（Allreduce）、データの分配（Scatter）と収集（Gather）、全対全交換（All‑to‑All）などである。MPI では、集合通信の作用域はコミュニケータにより明示され、同一コミュニケータの成員は該当操作を「全員が」呼び出すことが前提となる（根あり操作では根と非根の役割が定義される）。この「全員参加」という合意により、点対点通信では顕在化するメッセージマッチングの複雑性が抽象化される一方、プログラムの正しさは呼出し整合性とデッドロック回避の設計規律に依存する。MPI‑1.1 の仕様は集合通信の基本を定め、タグ非使用や同期効果に関する注意点も明記している。教育資料やチュートリアルも、集合通信を「コミュニケータ内の全プロセスが同じ呼出しを行う協調操作」と定義する。 ([MPI Forum](https://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node64.html?utm_source=chatgpt.com "4.1. Introduction and Overview - Message Passing Interface"), [LLNL HPC Tutorials](https://hpc-tutorials.llnl.gov/mpi/collective_communication_routines/?utm_source=chatgpt.com "Collective Communication Routines - LLNL HPC Tutorials")) ## **2. 意味論、正確性と数値的側面** 集合通信の意味論は、(i) 操作の完了条件（ブロッキング／ノンブロッキング／永続（persistent））、(ii) 演算の数学的性質（縮約演算の結合性・可換性）、(iii) データ型と派生データ型の整合性、(iv) 実装の決定性（同一環境での逐次再現性）に分けて考えると整理しやすい。たとえば MPI の縮約は、ユーザ定義演算 `MPI_Op_create` を含め結合律を仮定し、可換であるか否かを実装に伝えるフラグをとる。これにより木構造やペア交換など多様なアルゴリズム選択が可能となる一方、浮動小数点加算の非結合性ゆえに演算順序でビット列が変わり得る。MPI は実装の決定性を強く奨励するが、厳密に値を固定するためには順序拘束や再現可能和（ExBLAS/ReproBLAS のようなロングアキュムレータや binned accumulator）を用いる設計が知られる。仕様・実装解説・研究報告はいずれも、縮約の結合律仮定と可換性フラグ、そして決定性／数値再現性のトレードオフを扱っている。 ([Netlib](https://netlib.org/utk/papers/mpi-book/node118.html?utm_source=chatgpt.com "User-Defined Operations for Reduce and Scan"), [MPI Forum](https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node107.htm?utm_source=chatgpt.com "User-Defined Reduction Operations - Message Passing Interface"), [MCS](https://www.mcs.anl.gov/papers/P4093-0713_1.pdf?utm_source=chatgpt.com "On the Reproducibility of MPI Reduction Operations"), [NIST](https://www.nist.gov/document/nre-2015-04-iakymchukpdf?utm_source=chatgpt.com "ExBLAS: Reproducible and Accurate BLAS Library"), [ACM Digital Library](https://dl.acm.org/doi/fullHtml/10.1145/3389360?utm_source=chatgpt.com "Algorithms for Efficient Reproducible Floating Point Summation")) ## **3. 性能モデルと評価の基礎** 集合通信の設計・解析には、通信時間を固定遅延 α と転送時間 β·n に分解する Hockney の α–β モデル、レイテンシ L・オーバヘッド o・ギャップ g・プロセス数 P による LogP、長メッセージを扱う拡張 LogGP が広く使われる。これらはアルゴリズムの計算量見積りをネットワーク現実に接続する最小限の抽象モデルであり、たとえば二項木ブロードキャストの漸近時間を「α·⌈log₂p⌉ + β·n·⌈log₂p⌉」、リング Allreduce を「2·(p−1)·α + 2·(n/p)·(p−1)·β」といった形で評価する際の基盤となる。近年のレビューは α–β・LogP/LogGP の役割と限界を俯瞰し、階層性・混雑・ヘテロ性を考慮した拡張の必要も論じている。 ([SPCL](https://spcl.inf.ethz.ch/Teaching/2019-dphpc/lectures/lecture12-comm-models.pdf?utm_source=chatgpt.com "Communication Models - ETH Z"), [People @ EECS](https://people.eecs.berkeley.edu/~kubitron/cs258/handouts/papers/logp.pdf?utm_source=chatgpt.com "LogP: Towards a Realistic Model of Parallel Computation"), [ACM Digital Library](https://dl.acm.org/doi/epdf/10.1145/215399.215427?utm_source=chatgpt.com "LogGP: Incorporating Long Messages into the LogP Model")) ## **4. 基本操作とアルゴリズム設計** 集合通信アルゴリズムは、短メッセージで遅延最小を狙う「ログ時間型」（二項木、再帰倍加/半減、ディセミネーション系）と、長メッセージで帯域最適を狙う「リング／パイプライン型」、さらに階層性や不規則性に最適化した「ハイブリッド型」に大別できる。MPICH/Open MPI による古典的最適化研究は、Allreduce で reduce‑scatter＋allgather を合成する Rabenseifner 法、Allgather の再帰倍加、ブロードキャストのスキャッタ＋リング/再帰倍加併用、さらにメッセージ分割・セグメント化によるパイプライン化などを性能領域ごとに切り替える実践を確立した。ブロードキャストは線形・二分木・二項木・パイプラインなど多様な実装があり、メッセージ長／プロセス数／スイッチ階層に応じた選択が重要になる。 ([CELS](https://web.cels.anl.gov/~thakur/papers/ijhpca-coll.pdf?utm_source=chatgpt.com "Optimization of Collective Communication Operations in MPICH"), [HLRS](https://fs.hlrs.de/projects/rabenseifner/publ/myreduce_iccs2004_2long.pdf?utm_source=chatgpt.com "myreduce_iccs2004_2long.dvi - HLRS"), [hcl.ucd.ie](https://hcl.ucd.ie/system/files/tasus_2014.pdf?utm_source=chatgpt.com "High-Level Topology-Oblivious Optimization of MPI Broadcast Algorithms ..."), [ScienceDirect](https://www.sciencedirect.com/science/article/pii/S1569190X15000465?utm_source=chatgpt.com "Topology-oblivious optimization of MPI broadcast algorithms on extreme ...")) ## **5. 歴史的変遷：MPI と標準化の歩み** 90年代初頭の PVM 等を経て、MPI‑1（1994）で今日の集合通信 API の雛形が標準化された。MPI‑2（1997）は動的プロセス、MPI‑IO、拡張集合通信を追加し、MPI‑3（2012）はノンブロッキング集合通信（`MPI_I*`）と近傍集合通信を導入して、通信と計算のオーバラップとトポロジ指向の表現力を飛躍的に高めた。MPI‑4.0（2021）と 4.1 は永続集合通信（`MPI_*_init`）やセッション・モデル、パーティション化通信などを追加し、繰返し通信の設定コスト削減やライブラリ／モジュールの独立初期化、きめ細かな大規模転送の作り分けを可能にしている。 ([MPI Forum](https://www.mpi-forum.org/docs/mpi-2.0/mpi2-report.pdf?utm_source=chatgpt.com "MPI-2: Extensions to the Message-Passing Interface"), [Bill Gropp's Home Page](https://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture37.pdf?utm_source=chatgpt.com "lecture37.pptx - University of Illinois Urbana-Champaign"), [hpc-forge.cineca.it](https://hpc-forge.cineca.it/files/ScuolaCalcoloParallelo_WebDAV/public/anno-2016/12_Advanced_School/MPI3.pdf?utm_source=chatgpt.com "Introduction to the features of the MPI-3 standard"), [cvw.cac.cornell.edu](https://cvw.cac.cornell.edu/mpiadvtopics/process-topologies/neighborhood-collectives?utm_source=chatgpt.com "Cornell Virtual Workshop > MPI Advanced Topics > Process Topologies ...")) ## **6. ネットワーク・ハードウェアと“インネットワーク”集合通信** アーキテクチャの側からは、専用の集合通信ネットワークやスイッチ内オフロードの歴史が長い。IBM Blue Gene 系は 3D トーラスとは別に、ツリー型の「集団ネットワーク」をもってハードウェアレベルでブロードキャストや縮約を実装し、高スケールの低遅延を達成した。Cray XC/Aries は Dragonfly 系の高放射スイッチングに加えて、短メッセージ集合の一部を NIC/HCA 側で処理する Collective Engine を備える。近年は Mellanox/NVIDIA の SHArP（Scalable Hierarchical Aggregation Protocol）がスイッチで和・min/max などの縮約を実行し、ホスト間のデータホッピングを削減する。これらの仕組みは、アルゴリズム側の木構造・リング構造と組み合わせて、階層・サイズ領域ごとの最良経路を選ぶ実装へと進化している。 ([Wikipedia](https://en.wikipedia.org/wiki/IBM_Blue_Gene?utm_source=chatgpt.com "IBM Blue Gene - Wikipedia"), [Massachusetts Institute of Technology](https://web.mit.edu/bgl/overview.html?utm_source=chatgpt.com "MIT Blue Gene Home"), [Argonne Leadership Computing Facility](https://www.alcf.anl.gov/files/CrayXCNetwork.pdf?utm_source=chatgpt.com "Cray XC Series Network - Argonne National Laboratory"), [CUG](https://cug.org/proceedings/cug2018_proceedings/includes/files/pap131s2-file1.pdf?utm_source=chatgpt.com "Performance Evaluation of MPI on Cray XC40 Xeon Phi Systems - CUG"), [NVIDIA](https://network.nvidia.com/sites/default/files/related-docs/solutions/hpc/paperieee_copyright.pdf?utm_source=chatgpt.com "Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware ...")) ## **7. GPU 時代の集合通信** マルチ GPU・マルチノード時代には、GPU メモリ直結の集合通信（CUDA‑aware / GPUDirect RDMA）や NVLink/NVSwitch を前提としたライブラリの最適化が鍵となる。NVIDIA NCCL は Allreduce・Allgather・Broadcast 等を GPU に最適化し、リングと木の二系列を使い分け、階層的に NVLink/PCIe/ネットワークを接続する。AI 学習ではリング Allreduce が帯域最適であり、Horovod などは MPI/NCCL を後端にしてデータ並列 SGD の勾配集約を実用性能に乗せた。クラウドでは AWS EFA が SRD による OS バイパスを提供し、libfabric の EFA プロバイダを通じて MPI/NCCL と統合されている。 ([NVIDIA Developer](https://developer.nvidia.com/nccl?utm_source=chatgpt.com "NVIDIA Collective Communications Library (NCCL)"), [NVIDIA Docs](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html?utm_source=chatgpt.com "Collective Operations — NCCL 2.27.5 documentation"), [arXiv](https://arxiv.org/abs/1802.05799?utm_source=chatgpt.com "Horovod: fast and easy distributed deep learning in TensorFlow"), [AWS Documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html?utm_source=chatgpt.com "Elastic Fabric Adapter for AI/ML and HPC workloads on Amazon EC2"), [OFI Working Group](https://ofiwg.github.io/libfabric/v1.11.1/man/fi_efa.7.html?utm_source=chatgpt.com "Libfabric Programmer's Manual")) ## **8. ミドルウェア、実装戦略、チューニング** 実装は「アルゴリズムの選択」と「転送経路の選択」を実行時に決める。Open MPI では coll フレームワーク（tuned, han, libnbc, ucc, hcoll など）がメッセージ長・プロセス数・ハードウェア能力から最適なコンポーネントを選択する。ノンブロッキング集合通信は libNBC に端を発し、スケジューラがセグメント化とラウンド進行を管理して重ね合わせを可能にする。階層最適化は、ノード内共有メモリ縮約→ノード間通信→ノード内配布という 3 段構成で Allreduce/Reduce を高速化するのが定石であり、NUMA 配慮の共有メモリ集団やハードウェアマルチキャスト／SHArP／HCOLL を取り込む手法が一般化した。こうした統合の基盤として、UCX（通信）と UCC（集合）が MPI 実装に取り込まれつつある。 ([Open MPI Documentation](https://docs.open-mpi.org/en/v5.0.4/tuning-apps/coll-tuned.html?utm_source=chatgpt.com "11.10. Tuning Collectives — Open MPI 5.0.4 documentation"), [SPCL](https://spcl.ethz.ch/Publications/.pdf/hoefler-libnbc-design.pdf?utm_source=chatgpt.com "Design, Implementation, and Usage of LibNBC - ETH Z"), [Torsten Hoefler](https://htor.inf.ethz.ch/publications/img/hoefler-sc07.pdf?utm_source=chatgpt.com "Implementation and Performance Analysis of Non-Blocking Collective ..."), [Intel](https://www.intel.com/content/dam/www/public/us/en/ai/documents/Framework-for-Scalable-Intra-Node-Collective-Operations-using-Shared-Memory.pdf?utm_source=chatgpt.com "Framework for Scalable Intra-Node Collective Operations using Shared Memory"), [Snir](https://snir.cs.illinois.edu/listed/C85.pdf?utm_source=chatgpt.com "NUMA-Aware Shared-Memory Collective Communication for MPI"), [NVIDIA Docs](https://docs.nvidia.com/networking/display/hpcxv215/hcoll?utm_source=chatgpt.com "HCOLL - NVIDIA Docs")) ## **9. 基本動作の詳細：同期と進捗、パイプラインと階層化** ブロッキング集合は「局所完了（送信バッファが再利用可能）」の保証を与えるが、実装の進捗は通常はライブラリのポーリングや進捗スレッドに依存する。MPI‑3 のノンブロッキング集合は `MPI_I*` により計算と通信のオーバラップを明示的に記述でき、MPI‑4 の永続集合はパラメータ束縛コストを初回に前払いして繰返し通信を軽量化する。大容量メッセージではセグメント分割とパイプライン化が本質的であり、二項木系は α の項で優位、リング系は β の項で優位という古典的な対比が現れる。さらに実機では、ノード内は共有メモリ縮約、ノード間は RDMA／インネット縮約、ルータ階層では木とリングの組合せ、といったハイブリッド化が奏功する。 ([Bill Gropp's Home Page](https://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture37.pdf?utm_source=chatgpt.com "lecture37.pptx - University of Illinois Urbana-Champaign"), [MPI Forum](https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report/node159.htm?utm_source=chatgpt.com "Persistent Collective Operations - mpi-forum.org")) ## **10. 近傍集合通信とトポロジ適応** MPI‑3 が導入した近傍集合は、デカルト格子や一般グラフの「近傍」上で Allgather/Alltoall 等を実行する API であり、有限差分や格子ボルツマン法のような局所通信支配のアプリケーションにおいて、冗長な全域通信を避けられる。これはプロセストポロジ API（Cartesian/Graph）と組で用いられ、実装はトポロジ認識や階層認識の最適化（例えば NUMA → ラック → ポッド）と自然に親和する。 ([hpc-forge.cineca.it](https://hpc-forge.cineca.it/files/ScuolaCalcoloParallelo_WebDAV/public/anno-2016/12_Advanced_School/MPI3.pdf?utm_source=chatgpt.com "Introduction to the features of the MPI-3 standard"), [cvw.cac.cornell.edu](https://cvw.cac.cornell.edu/mpiadvtopics/process-topologies/neighborhood-collectives?utm_source=chatgpt.com "Cornell Virtual Workshop > MPI Advanced Topics > Process Topologies ...")) ## **11. 異常系・回復と集合通信** ペタ〜エクサスケール環境では、集合通信の「全員参加」という前提が障害で破綻しやすい。MPI 標準は致命的障害時にコミュニケータを破壊して `MPI_Abort` に至るのが伝統だったが、ユーザ主導で継続実行を可能にする ULFM（User‑Level Failure Mitigation）や、MPI レイヤを再初期化して再始動を効率化する Reinit の研究が進む。ULFM は失敗の通知・合意・コミュニケータ再構築 API を与え、Reinit はグローバル再始動の高速化を目指す。近年の実装・評価は、チェックポイント再始動より高速に回復できるケースを報告している。MPI‑4 のセッション・モデルはコミュニケーションの分離により障害隔離の設計余地を広げる。 ([Pavan Balaji](https://pavanbalaji.github.io/pubs/2015/ccgrid/ccgrid15.ulfm.pdf?utm_source=chatgpt.com "Lessons Learned Implementing User-Level Failure Mitigation in MPICH"), [Innovative Computing Laboratory](https://icl.utk.edu/files/publications/2020/icl-utk-1475-2020.pdf?utm_source=chatgpt.com "Fault tolerance of MPI applications in exascale systems: The ULFM solution"), [arXiv](https://arxiv.org/abs/2102.06896?utm_source=chatgpt.com "[2102.06896] Reinit++: Evaluating the Performance of Global-Restart ...")) ## **12. 大規模 AI 学習における集合通信の再発明** データ並列 SGD では、リング型 Allreduce が帯域最適で実装容易な性質から事実上の標準となり、Horovod などが MPI/NCCL を後端に採用して数十〜数百 GPU へ線形に近いスケールを示した。リング Allreduce は reduce‑scatter と allgather を融合した二段法で、各ランクのデータをパイプラインで循環させるため、リンクのボトルネックを飽和させやすい。一方で遅延最小の木系や、プロセス到着ずれ（PAP）に強い改良アルゴリズムも提案されており、混載クラスタやヘテロノードでの堅牢性を補っている。 ([arXiv](https://arxiv.org/abs/1802.05799?utm_source=chatgpt.com "Horovod: fast and easy distributed deep learning in TensorFlow"), [Andrew's Notes](https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/?utm_source=chatgpt.com "Bringing HPC Techniques to Deep Learning - Andrew Gibiansky")) ## **13. 構成要素技術（ネットワーク・OS・ランタイム・数値）** 集合通信を支える要素は層ごとに分解できる。物理層では InfiniBand/RoCE/Slingshot/EFA 等が RDMA・OS バイパス・ハードウェア縮約を提供し、EFA は SRD によるユーザ空間データグラムを libfabric の `fi_efa` で公開する。GPU 層では NCCL が NVLink/NVSwitch を活用し、PCIe を跨ぐ場合はピン留め・登録キャッシュや GPUDirect RDMA を駆使する。MPI 実装層では UCX（通信）と UCC（集合）・HCOLL（ヒエラルヒ集合）・libNBC（非同期スケジューラ）などのコンポーネントを動的選択し、共有メモリによるノード内縮約や NUMA 配慮の最適化が Allreduce/Reduce の臨界路を短縮する。数値層では、再現可能縮約（ExBLAS/ReproBLAS）により、並列縮約の順序依存性を抑えたデバッグ・検証可能性が得られる。 ([OFI Working Group](https://ofiwg.github.io/libfabric/v1.11.1/man/fi_efa.7.html?utm_source=chatgpt.com "Libfabric Programmer's Manual"), [AWS Documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html?utm_source=chatgpt.com "Elastic Fabric Adapter for AI/ML and HPC workloads on Amazon EC2"), [NVIDIA Developer](https://developer.nvidia.com/nccl?utm_source=chatgpt.com "NVIDIA Collective Communications Library (NCCL)"), [Open MPI Documentation](https://docs.open-mpi.org/en/v5.0.4/tuning-apps/coll-tuned.html?utm_source=chatgpt.com "11.10. Tuning Collectives — Open MPI 5.0.4 documentation"), [Intel](https://www.intel.com/content/dam/www/public/us/en/ai/documents/Framework-for-Scalable-Intra-Node-Collective-Operations-using-Shared-Memory.pdf?utm_source=chatgpt.com "Framework for Scalable Intra-Node Collective Operations using Shared Memory"), [NIST](https://www.nist.gov/document/nre-2015-04-iakymchukpdf?utm_source=chatgpt.com "ExBLAS: Reproducible and Accurate BLAS Library")) ## **14. 設計パターンの原則** 運用上の設計原則として、第一に「α 支配域 vs β 支配域」の切り分けに基づくアルゴリズム選択、第二に「階層整合（intra‑node → inter‑node → intra‑node）」の適用、第三に「セグメント化・パイプライン化によるリンク飽和」、第四に「非同期進捗と永続化によるオーバヘッド平準化」、第五に「トポロジ認識・近傍化による不要通信の削減」が挙げられる。これらは α–β/LogP/LogGP による予測と実機の自動チューニング機構（実装内の規則ベースやプロファイル駆動）により具体化される。MPI‑3 の非ブロッキングと MPI‑4 の永続集合は、上記の原則を API レベルで表現するための重要な武器である。 ([SPCL](https://spcl.inf.ethz.ch/Teaching/2019-dphpc/lectures/lecture12-comm-models.pdf?utm_source=chatgpt.com "Communication Models - ETH Z"), [Bill Gropp's Home Page](https://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture37.pdf?utm_source=chatgpt.com "lecture37.pptx - University of Illinois Urbana-Champaign"), [MPI Forum](https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report/node159.htm?utm_source=chatgpt.com "Persistent Collective Operations - mpi-forum.org")) ## **15. 位置付けと他分野の“グループ通信”との差異** 分散システム分野のグループ通信（Virtual Synchrony/ISIS/Spread 等）は、順序保証やメンバーシップ変化の下での一貫性維持を主眼とし、障害時の安全性と整合性が第一義の要件となる。一方 HPC の集合通信は、静的メンバーシップと高効率な数値データ移送・縮約を主目的とし、順序やメンバーシップの強力な抽象化は MPI ランタイムが担う。両者は「グループに対する多者通信」という共通点を持つが、設計ゴールと意味論は本質的に異なる。 ([LASS](https://lass.cs.umass.edu/~shenoy/courses/spring08/readings/birman.pdf?utm_source=chatgpt.com "EXPLOITING VIRTUAL SYNCHRONY IN DISTRIBUTED SYSTEMS - UMass"), [spread.org](https://www.spread.org/?utm_source=chatgpt.com "The Spread toolkit")) ## **16. まとめと展望** 集合通信は、抽象的には「集合上の代数演算」だが、実体はネットワーク／GPU／OS／NIC／スイッチ／ランタイムが織り成す多層協調の産物である。標準化の流れは MPI‑3 の非同期化、MPI‑4 の永続化・セッション化へと進み、実装は階層・ハードウェアオフロード・クラウド OS バイパスを取り込みながら、HPC と AI の双方で「通信を計算に重ねる」方向へ成熟している。数値再現性や耐故障性は、ULFM/Reinit や再現可能縮約により実用域が広がりつつあり、今後はスイッチ内計算の一般化、階層自動検出、トポロジ特化最適化と学習的チューニングの融合が鍵となるだろう。 ([MPI Forum](https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report/node159.htm?utm_source=chatgpt.com "Persistent Collective Operations - mpi-forum.org"), [Pavan Balaji](https://pavanbalaji.github.io/pubs/2015/ccgrid/ccgrid15.ulfm.pdf?utm_source=chatgpt.com "Lessons Learned Implementing User-Level Failure Mitigation in MPICH"), [arXiv](https://arxiv.org/abs/2102.06896?utm_source=chatgpt.com "[2102.06896] Reinit++: Evaluating the Performance of Global-Restart ...")) --- **参考（主要典拠）** MPI 仕様・教材：MPI‑1.1/2.0/3/4 の該当節、LLNL/Cornell/Illinois などの教材。アルゴリズム最適化：Thakur らによる MPICH 集合の最適化、Rabenseifner の Allreduce（reduce‑scatter＋allgather）。性能モデル：α–β、LogP/LogGP。ハードウェア：Blue Gene の集団ネットワーク、Cray Aries の Collective Engine、NVIDIA SHArP。GPU 集合：NCCL と Horovod。ミドルウェア：Open MPI coll フレームワーク、libNBC、HCOLL、EFA/libfabric。数値再現性：ExBLAS/ReproBLAS。本文中で要点ごとに出典を示したとおりである（重複再掲は割愛）。 ([MPI Forum](https://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node64.html?utm_source=chatgpt.com "4.1. Introduction and Overview - Message Passing Interface"), [LLNL HPC Tutorials](https://hpc-tutorials.llnl.gov/mpi/collective_communication_routines/?utm_source=chatgpt.com "Collective Communication Routines - LLNL HPC Tutorials"), [Bill Gropp's Home Page](https://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture37.pdf?utm_source=chatgpt.com "lecture37.pptx - University of Illinois Urbana-Champaign"), [hpc-forge.cineca.it](https://hpc-forge.cineca.it/files/ScuolaCalcoloParallelo_WebDAV/public/anno-2016/12_Advanced_School/MPI3.pdf?utm_source=chatgpt.com "Introduction to the features of the MPI-3 standard"), [cvw.cac.cornell.edu](https://cvw.cac.cornell.edu/mpiadvtopics/process-topologies/neighborhood-collectives?utm_source=chatgpt.com "Cornell Virtual Workshop > MPI Advanced Topics > Process Topologies ..."), [CELS](https://web.cels.anl.gov/~thakur/papers/ijhpca-coll.pdf?utm_source=chatgpt.com "Optimization of Collective Communication Operations in MPICH"), [HLRS](https://fs.hlrs.de/projects/rabenseifner/publ/myreduce_iccs2004_2long.pdf?utm_source=chatgpt.com "myreduce_iccs2004_2long.dvi - HLRS"), [SPCL](https://spcl.inf.ethz.ch/Teaching/2019-dphpc/lectures/lecture12-comm-models.pdf?utm_source=chatgpt.com "Communication Models - ETH Z"), [People @ EECS](https://people.eecs.berkeley.edu/~kubitron/cs258/handouts/papers/logp.pdf?utm_source=chatgpt.com "LogP: Towards a Realistic Model of Parallel Computation"), [ACM Digital Library](https://dl.acm.org/doi/epdf/10.1145/215399.215427?utm_source=chatgpt.com "LogGP: Incorporating Long Messages into the LogP Model"), [Wikipedia](https://en.wikipedia.org/wiki/IBM_Blue_Gene?utm_source=chatgpt.com "IBM Blue Gene - Wikipedia"), [Argonne Leadership Computing Facility](https://www.alcf.anl.gov/files/CrayXCNetwork.pdf?utm_source=chatgpt.com "Cray XC Series Network - Argonne National Laboratory"), [NVIDIA](https://network.nvidia.com/sites/default/files/related-docs/solutions/hpc/paperieee_copyright.pdf?utm_source=chatgpt.com "Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware ..."), [NVIDIA Developer](https://developer.nvidia.com/nccl?utm_source=chatgpt.com "NVIDIA Collective Communications Library (NCCL)"), [arXiv](https://arxiv.org/abs/1802.05799?utm_source=chatgpt.com "Horovod: fast and easy distributed deep learning in TensorFlow"), [Open MPI Documentation](https://docs.open-mpi.org/en/v5.0.4/tuning-apps/coll-tuned.html?utm_source=chatgpt.com "11.10. Tuning Collectives — Open MPI 5.0.4 documentation"), [SPCL](https://spcl.ethz.ch/Publications/.pdf/hoefler-libnbc-design.pdf?utm_source=chatgpt.com "Design, Implementation, and Usage of LibNBC - ETH Z"), [Intel](https://www.intel.com/content/dam/www/public/us/en/ai/documents/Framework-for-Scalable-Intra-Node-Collective-Operations-using-Shared-Memory.pdf?utm_source=chatgpt.com "Framework for Scalable Intra-Node Collective Operations using Shared Memory"), [NIST](https://www.nist.gov/document/nre-2015-04-iakymchukpdf?utm_source=chatgpt.com "ExBLAS: Reproducible and Accurate BLAS Library"))