fold-k4-from-2026-06-18-to-2026-06-19-n16

Level-4 fold of 16 log entries spanning 2026-06-18 to 2026-06-19. Dominant themes: FlashAttention 系列の通時的アーキテクチャ進化、KV キャッシュ中心の LLM 推論最適化、マイクロサービス障害診断。 ## Child Entries | Date | Op | Title | Page | Summary (extractive) | |---|---|---|---|---| | 2026-06-19 | batch-ingest | 稲見昌彦「科学とAIとループ」3部作エッセイ | [[@2026__note.com__科学の終焉と、新しい科学の始まり]] | 3 ソース / 10 エンティティ / 13 コンセプトを一括生成。「ループから外れた人間はどこへ行くか」をサイバネティクスと Human-out-of-the-loop で展開。 | | 2026-06-19 | ingest-slides | Latency SLOs Done Right | [[@2019__SREcon19 EMEA__Latency SLOs Done Right]] | レイテンシ SLO はパーセンタイル時系列ではなく、期間全体のイベント集合に対するしきい値以内リクエスト割合を数える問題。 | | 2026-06-19 | ingest-paper | Failure Diagnosis in Microservice Systems | [[@2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]] | 98 論文スケールで result fusion → model fusion → feature fusion のマルチモーダル進化線を体系化。LLM + 知識グラフ統合を次の方向に指定。 | | 2026-06-19 | ingest-paper | Graph-based Incident Aggregation | [[@2021__ASE__Graph-based Incident Aggregation for Large-Scale Online Service Systems]] | GRLIA がテキスト非類似インシデントの束ね方と、KPI トレンドによる沈黙ノード補完を定式化。 | | 2026-06-18 | ingest-slides | AIスーパーコンピュータにおけるLLM学習処理性能の計測と可観測性 | [[@2025__SpeakerDeck__AIスーパーコンピュータにおけるLLM学習処理性能の計測と可観測性]] | AI スパコン可観測性の困難はリソース計装だけでなくクラウド事業者の責任境界に由来。OTel + Grafana から学習処理スパンへ意味を戻すことが課題。 | | 2026-06-18 | ingest-article | Introducing Contextual Retrieval | [[@2024__Anthropic Engineering Blog__Introducing Contextual Retrieval]] | Contextual Embeddings + BM25 + リランキングの累積効果で検索失敗率 67% 削減。ベースライン失敗率 5.7%（二次ソース 5.0% と齟齬あり、一次資料を正とする）。 | | 2026-06-18 | ingest-slides | Scaling KV Caches for LLMs | [[@2025__PyTorchConference__Scaling KV Caches for LLMs - How LMCache + NIXL Handle Network and Storage Heterogeneity]] | KV キャッシュ最適化は GPU 内粒度から DRAM/VRAM/BLK/FILE/OBJ の転送制御面を含む設計問題に拡張。 | | 2026-06-18 | ingest-paper | FlashAttention | [[@2022__arXiv__FlashAttention - Fast and Memory-Efficient Exact Attention with IO-Awareness]] | タイリング + オンライン softmax + 再計算で IO 複雑度 O(N²d²M⁻¹) の厳密アテンションを 2-4 倍高速化。 | | 2026-06-18 | ingest-paper | FlashAttention-2 | [[@2023__arXiv__FlashAttention-2 - Faster Attention with Better Parallelism and Work Partitioning]] | 非 MMA FLOP 削減とシーケンス長並列化で A100 利用率 50-73%、225 TFLOP/秒に改善。 | | 2026-06-18 | ingest-paper | FlashAttention-3 | [[@2024__arXiv__FlashAttention-3 - Fast and Accurate Attention with Asynchrony and Low-precision]] | Hopper ワープ特化 + FP8 ブロック量子化で 740 TFLOP/秒（75% 利用率）と数値誤差 2.6 倍改善を達成。 | | 2026-06-18 | ingest-paper | FlashAttention-4 | [[@2026__arXiv__FlashAttention-4 - Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling]] | Blackwell の非対称スケーリング問題を特定し、ソフトウェアエミュレート指数関数 + TMEM + CuTe-DSL で 1613 TFLOP/秒（71%）。 | | 2026-06-18 | ingest-paper | AIBrix | [[@2025__arXiv__AIBrix - Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure]] | Kubernetes + Ray ハイブリッドで推論エンジンを vendor-neutral に収容。分散 KV キャッシュで 50% スループット向上・70% レイテンシ削減。 | | 2026-06-18 | ingest-paper | GPT-4 Technical Report | [[@2023__arXiv__GPT-4 Technical Report]] | 予測可能スケーリング（1/1,000〜1/10,000 計算量で性能事前予測）と RLHF キャリブレーション劣化（ECE 0.007→0.074）を定量化。 | | 2026-06-18 | wiki-query | TSFM・TS-MLLM・Toto-Qwen3-VL 比較 | [[TSFM-TSMLLM-TotoQwen3VL-比較と基礎]] | TSFM → TS-MLLM の 2 段スタック構造を整理。Toto-Qwen3-VL が実証例として位置づけ。 | | 2026-06-18 | ingest-paper | KV キャッシュ・GPU クラスタ論文 5 本 | [[@2026__arXiv__KVCache Cache in the Wild - Characterizing and Optimizing KVCache Cache at a Large Cloud Provider]] | 本番 KV キャッシュヒット率 54-62%。CacheBlend → KVShare で非プリフィックス選択的再計算が発展。SCBench で sub-O(n) 手法のマルチターン破綻を確認。 | | 2026-06-18 | enrich-source | From Attention to Disaggregation の充実化 | [[@2025__arXiv__From Attention to Disaggregation - Tracing the Evolution of LLM Inference]] | 22 ページ再走査で 6 最適化テーブルと PD 分離 3 アーキタイプ比較を追加。CAP 解釈は設計語彙として扱う。 | ## Key Outcomes - FlashAttention 4 世代（v1–v4）を通時的に取り込み、IO-aware タイリング（v1, 2-4 倍）→ A100 225 TFLOP/秒（v2）→ Hopper 740 TFLOP/秒（v3）→ Blackwell 1613 TFLOP/秒（v4）と GPU アテンション最適化の全体像を wiki に構造化した（2026-06-18 FlashAttention 4 エントリ） - 本番 KV キャッシュヒット率は合成ベンチマーク（80% 超）に対し 54-62% にとどまり、ワークロード対応エビクションが必須であることが定量化された（2026-06-18 KV キャッシュ 5 本エントリ） - RAG 向け Contextual Embeddings + BM25 + リランキングの技術累積で検索失敗率 67% 削減を実現。ベースライン失敗率は一次資料で 5.7%（2026-06-18 Contextual Retrieval エントリ） - マイクロサービス障害診断を 98 論文スケールで体系化し、result fusion → model fusion → feature fusion のマルチモーダル進化線と LLM + 知識グラフ統合の方向を明示した（2026-06-19 Failure Diagnosis エントリ） - 稲見昌彦 3 部作で 3 ソース / 10 エンティティ / 13 コンセプトを一括生成し、サイバネティクスと Human-out-of-the-loop を理論的基盤として wiki に導入した（2026-06-19 batch-ingest エントリ） - GPT-4 の予測可能スケーリング（1/1,000〜1/10,000 計算量で性能予測）と RLHF キャリブレーション劣化（ECE 0.007→0.074）が定量化された（2026-06-18 GPT-4 エントリ） ## Cross-entry Themes - **FlashAttention 系列の通時的アーキテクチャ進化**: GPU ハードウェアのボトルネックが HBM 帯域（v1）→ 非 MMA FLOP（v2）→ ワープ特化（v3）→ 非対称スケーリング（v4）へ移行し、アルゴリズムとカーネル設計がそれに追従するパターン（supported by: 2026-06-18 FlashAttention v1, v2, v3, v4 エントリ） - **KV キャッシュ中心の LLM 推論最適化**: GPU 内ページ化 → 階層ストレージ → 転送制御面 → クラウドネイティブ制御プレーンへと管理スコープが拡張。本番ワークロード特性と PD 分離アーキテクチャが設計空間を多面的に構成する（supported by: 2026-06-18 LMCache+NIXL, AIBrix, KV キャッシュ 5 本, From Attention to Disaggregation エントリ） - **マイクロサービス障害診断のマルチモーダル化**: 98 論文サーベイの進化線体系化と GRLIA のグラフベースインシデント集約が障害診断の手法地図を補完する（supported by: 2026-06-19 Failure Diagnosis, Graph-based Incident Aggregation エントリ） ## Contradictions or Corrections - Contextual Retrieval エントリ: ベースライン失敗率について一次資料 5.7% vs 二次ソース 5.0% の齟齬を報告済み。一次資料を正とする旨、エントリ自体に明記。 ## Child Pages - [[@2026__note.com__科学の終焉と、新しい科学の始まり]] - [[@2026__note.com__Out of the Blue]] - [[@2026__note.com__ループのボトルネックは、人間だ]] - [[@2019__SREcon19 EMEA__Latency SLOs Done Right]] - [[@2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]] - [[@2021__ASE__Graph-based Incident Aggregation for Large-Scale Online Service Systems]] - [[@2025__SpeakerDeck__AIスーパーコンピュータにおけるLLM学習処理性能の計測と可観測性]] - [[@2024__Anthropic Engineering Blog__Introducing Contextual Retrieval]] - [[@2025__PyTorchConference__Scaling KV Caches for LLMs - How LMCache + NIXL Handle Network and Storage Heterogeneity]] - [[@2022__arXiv__FlashAttention - Fast and Memory-Efficient Exact Attention with IO-Awareness]] - [[@2023__arXiv__FlashAttention-2 - Faster Attention with Better Parallelism and Work Partitioning]] - [[@2024__arXiv__FlashAttention-3 - Fast and Accurate Attention with Asynchrony and Low-precision]] - [[@2026__arXiv__FlashAttention-4 - Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling]] - [[@2025__arXiv__AIBrix - Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure]] - [[@2023__arXiv__GPT-4 Technical Report]] - [[TSFM-TSMLLM-TotoQwen3VL-比較と基礎]] - [[@2026__arXiv__KVCache Cache in the Wild - Characterizing and Optimizing KVCache Cache at a Large Cloud Provider]] - [[@2022__NSDI__MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters]] - [[@2025__EuroSys__CacheBlend - Fast Large Language Model Serving for RAG with Cached Knowledge Fusion]] - [[@2025__arXiv__KVShare - An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse]] - [[@2025__ICLR__SCBench - A KV Cache-Centric Analysis of Long-Context Methods]] - [[@2025__arXiv__From Attention to Disaggregation - Tracing the Evolution of LLM Inference]] - [[GRLIA]] - [[Daniel Ford]] - [[Moein Khazraee]] - [[Tri Dao]] - [[Jay Shah]] - [[NVIDIA Dynamo]] - [[AIBrix]] ## Related - [[DragonScale Memory]] - fold-operator spec - [[log]] - source entries - [[index]] - vault catalog