The Pile - yuuk1's Digital Garden

# The Pile 825GB の多様な英語テキストコーパス。LLM 事前学習で広く使われる。(Source: [[@2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]]) - GPUPerf 論文では現実的なワークロードを構成するため、The Pile を [[GPT-NeoX]]-20B トークナイザ(語彙 50,257)で処理して訓練・評価コーパスに用いる。 ## 関連 - ソース: [[@2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]] - 関連実装: [[GPT-NeoX]] - 概念: [[LLM分散学習]]