2026__MLSys2026__Practical Adversarial Multi-Armed Bandits with Sublinear Runtime

# Practical Adversarial Multi-Armed Bandits with Sublinear Runtime > [!info] Talk metadata > - **会議:** [[MLSys2026]] Day 3 (May 20 / Wed)、Research Track Oral "LLM Serving 3 & ML for Systems" セッション（13:30 開始。個別の発表時刻は資料から特定できないため省略扱い） > - **登壇者:** Kasper Overgaard Mortensen（Aarhus University, Department of Computer Science、theoretical computer science 専門） > - **共著者:** Kasper Overgaard Mortensen・Ama Bembua Bainson・Mathias Ravn Tversted・Kristoffer Strube Græm・Andrea Paudice・Davide Mottin・Panagiotis Karras（Aarhus University）、Renata Borovica-Gajic（University of Melbourne）。Karras は University of Copenhagen も所属（論文 footnote）。 > - **フレームワーク:** QBL（**Q**ueuing **B**ehind the **L**eader）。combinatorial 版 QBL.M、単一選択用 Exp3 Fast / Exp3Light Fast。 > - **関連研究:** コード・データセット `https://github.com/AU-DIS/QBL` > [!abstract] 概要（論文 PDF アブストラクト・忠実日本語訳） > 我々は、loss 系列のシフトにより最適 arm の identity が時間とともに変化しうる**非定常な敵対的環境（nonstationary adversarial environments）**における Multi-Armed Bandit 問題を研究する。データベースシステムにおける physical design tuning のような応用に動機づけられ、**arm 数 k が非常に大きい**設定に焦点を当て、k に対して sublinear な runtime を持つ実用的アルゴリズムを求める。主要な貢献は新規アルゴリズム **Queuing Behind the Leader（QBL）** であり、各反復の計算量 $\mathcal{O}(m \log k)$ を達成する（m は各ステップで選択される arm 数）。QBL は priority queue を介した限定的な更新操作・定数時間のサンプリングオーバーヘッド・balanced な探索戦略を組み合わせる。最新ベンチマークで QBL を広範に評価し、time と quality の双方で既存手法を一貫して上回ることを示す。 > [!warning] 出典に関する注記 > 正式タイトル・著者・所属・フレームワーク名は論文 PDF およびスライド表紙を正とした。文字起こしは自動生成で固有名詞が崩れており（"Casper"→Kasper、"Argos University"→Aarhus University、"Xfree/XP"→Exp3、"Cube and the leader/Cuba"→QBL = Queuing Behind the Leader、"lock time"→log time）、narrative 補完にのみ用いた。**このトークの Q&A はセッション録音終了のため文字起こしに含まれない**（Q&A セクションは省略）。 ## 問題設定: 大規模・非定常で古典 bandit が破綻する - Multi-Armed Bandit（MAB）は逐次的意思決定の基本モデル。各ステップで k 本の arm の中から選び、reward を観測し、estimate を更新する（スライド「Classical MABs - Simple online learning models」: Pick / Observe / Update の 3 ステップ）。 - 本研究は **combinatorial adversarial MAB の非定常版**を扱う。各ラウンドで learner は ground set $[k]$ から **m 本の部分集合** $S_t$（$|S_t|=m$）を選び、adversary が報酬ベクトル $r_t \in [0,1]^k$ を選ぶ。性能は**動的リグレット（dynamic regret）**で評価（論文式(1)、各ラウンドで瞬時報酬を最大化する size-m 部分集合を比較対象とする）。$m=1$ で古典的な非定常 adversarial MAB に帰着。 - **動機は通常と逆**: 気にするのは結果の品質（reward）よりも**アルゴリズムの runtime**。古典 MAB は全 arm 上の確率分布を維持する必要があり、arm が非常に多い大規模設定ではスケールしない（スライド「Problem: Runtime scales with number of arms」: ① k arm 上の進化する確率分布の維持、② 毎反復で複数 weight を perturb する、の 2 つがボトルネック）。 - よくある対処は「設定を変えて問題を回避する」こと（context / oracles / linear rewards / continuous arm space / experts / sketching 等の新仮定の追加。スライド「The common approach: change setting!」）。しかし特化手法も**シンプルな古典 baseline を上回るべき**であり、その比較対象としてスケールする古典 bandit が要る（スライド「Complexity must add value.」: "This should beat that."）。 - 「そもそも k-armed bandit が使えるのか」という懸念も存在（スライド「Should we care?」に John Langford の引用: real-world k-armed bandit setting を context 化せずに見つけるのは難しい）。だが複雑な改造を正当化するための simple baseline として必要、というのが本研究の立場。 ## 動機の実例: database index tuning が「何もしない」より悪化する - bandit を大規模システム（agentic/system 系）に組み込みたい。例として **database index tuning** に bandit が適用されてきた（**DBABandit**（Perera et al., 2021）、**HMAB**（Perera et al., 2022））。これらは contextual stochastic MAB 手法で、想定仮定の内側では良く動くが extreme case で破綻する。 - 非定常ワークロード下では DBABandit/HMAB が**「No Index（何もしない）」より遅くなる**極端ケースが存在（論文 Figure 1・スライド「System integration」: Total Workload Time Per Round、TPC-H 10GB、index tuning・nonstationary workload。DBABandit/HMAB は 30–45s 付近で振動し、No Index の ~10s を大きく上回る）。シフトのたびに index を作り直すため逆効果になる。 - そこで「こうした extreme case をカバーできる古典 adversarial bandit を baseline にしたい。だがそれらは scale しない」というギャップが本研究の出発点（スライド「A classic adversarial bandit would be a nice baseline… but they don't scale.」）。 ## Exp3 を効率化する: log-time 化・数値安定化・combinatorial 拡張 - **Exp3**（Auer et al., 2003）は古典的 adversarial bandit だが runtime が arm 数 k に**線形**（スライド「Exp3 is such a bandit.」: $\mathcal{O}(k)$ Exp3、combinatorial 版 Exp3.M は $\mathcal{O}(k \log m)$）。 - **log-time 化は「実は 10 年前に解かれていた」が忘れられている**（スライド "Solved 10 years ago." / "Forgotten folklore?"）。論文は、sublinear 化が可能だという指摘は Tim Vieira のブログ（Vieira, 2016「Heaps for incremental computation」）の side-note に留まり、long time-horizon・large k での end-to-end な実用化までは示されていなかった、と整理する。本研究はこのギャップを埋め、Vieira の sum-heap 観察を **stable な log-weight 方式**と組み合わせて per-step $\mathcal{O}(\log k)$ を明示的に構成する。 - **数値安定性**: Exp3 の指数的 weight 更新は long horizon で overflow しうる。Log-Sum-Exp（LSE）trick を毎ラウンド線形パスせずに済むよう、**streaming 版の updatable Sum-Exp**（論文 Algorithm 1 `UpdateSumExp`）として導入し、weight を log scale で維持して定数時間オーバーヘッドで確率を計算する。 - **sum-heap サンプリング**: 部分和木（binary heap、葉に weight、根に総和）を使い、inversion method で uniform random probe から $\mathcal{O}(\log k)$ サンプリング。weight 更新も $\mathcal{O}(\log k)$（スライド「Sum-heap sampling solves the linear overhead」: $p = \mathrm{Uniform}(0,1)\cdot \mathrm{SUM}$、heap の sample/update が $O(\log n)$・init が $O(n)$）。sentinel 値で sampling without replacement を可能にし、combinatorial 設定にも適用。 - **combinatorial 拡張**: heap 強化版 Exp3.M（論文 Algorithm 4 `Exp3.M Heap`）により、Exp3.M の総時間計算量を $\mathcal{O}(k \log m)$ から $\mathcal{O}(m \log k)$ に改善（$m \ll k$ のとき有利）。単一選択ケースには `Exp3 Fast`（論文 Algorithm 6）・`Exp3Light Fast` を導出。 ## 提案フレームワーク QBL: Queuing Behind the Leader - Exp3 系の single-weight 更新は時間効率的だが、動的に更新される確率分布からのサンプリングがオーバーヘッドを生む。一方 **follow-the-leader（FTL）**（Kalai & Vempala, 2005）は全 weight を randomly perturb しつつ常に最良 arm を選ぶことで定数時間サンプリングを得る。**QBL は両者の良いとこ取り**: 各ラウンドで**実際に選んだ arm の weight だけ**を更新する（論文 §4 Overview）。 - QBL の 2 つの鍵: (1) 見かけ上良い選択を**conservative に評価**して reward を測る、(2) 各選択をその reward と独立に順序付ける**動的 priority queue** を維持する。adversarial/非定常設定での **policy overcommitment**（高 reward arm が weight を溜め込み将来選択を歪める selection bias）を防ぐ設計。 - **exploitation 戦略**: 現在の最良選択（**leader**）が過去比で改善し続けている間だけ leader を保持する。各 arm の selection counter $c[a]_t$ と累積 reward $r[a]_t$ から local mean $L_t[a] := r[a]_t / c[a]_t$ を計算し、global weighted average $G_t := R_t / C_t$ と比較。$L_t[a] > G_t$ なら leader を選び続け、探索しない。進捗が止まると leader を **demote（降格）**して他 arm の探索を再開する。 - demotion の決定論性は adversary に悪用されうるため**ランダム化を導入**: $G_t \ge \dfrac{r[a]_t}{c[a]_t}\cdot \mathrm{Uniform}(1-\gamma, 1+\gamma)$（$\gamma \in (0,1]$）で leader を demote するか判定し、leader がどれだけ長く残れるかを難読化する。leader が「良すぎて」global average のほぼ全部を占める場合も demotion のサイン（選びすぎ）。 - **plutocracy as a queue（demotion 実装）**: 通常 weight を維持する代わりに、**priority queue**（≒ cube）で各要素の優先度差に上限を設ける。leader を降格させるとき priority を最低 1 だけ下げ（$p[a]\!-\!1$ と $p[\mathrm{Top}(Q)]\!-\!k\!+\!c[a]$ の min）、**各 leader が高々 k 回の demotion 以内に必ず置き換わる**ことを保証。同じ arm を選び続ける限り update を skip でき、demotion / queue 維持時のみコストがかかる（論文 Algorithm 5 `QBL.M`）。 - **計算量**: QBL.M の 1 ラウンドは $\mathcal{O}(m \log k)$（priority queue を heap 実装、sampling が毎ラウンドのボトルネック）。$m=1$ では全体を $\mathcal{O}(\log k)$ に削減でき、leader が demote されない場合は $\mathcal{O}(1)$（論文 §4 Complexity）。**ユーザ調整パラメータは exploration factor $\gamma$ の 1 つのみ**。QBL は環境 dynamics に agnostic で、定常・非定常の双方で使える。 - QBL は理論的な regret 保証を**まだ持たない（preliminary step）**ことを明言。demotion interval を high probability で bound できるかが open question（論文 Contributions・Future directions）。 ## 実験: index tuning と模擬非定常環境 - **実装**: 全 bandit policy とシミュレーションを C++（`-O3`）で実装。weight は log scale 維持。コード・データは `https://github.com/AU-DIS/QBL`（論文 Appendix B）。index tuning 実験は Ubuntu 24.04.1（Xeon Silver 4316, 2.3GHz, 1TB RAM）、その他は Ubuntu 20.04.4（Core i7-10610U, 1.8GHz, 48GB RAM）。 - **database index tuning（§5.1）**: QBL.M を DBABandit・HMAB と比較（Python 実装フレームワーク上）。QBL.M は task に oblivious だが、公平性のため DBABandit の arm 選択戦略・memory budget（≥95% 充足）へ適応させた。TPC-H 10GB/50GB、各 100 ラウンド・5 query/ラウンドの business-oriented ad-hoc query。 - 結果（論文 Figure 2: Total Workload Time / Index Creation Time）: DBABandit/HMAB は非定常ワークロードでシフトのたびに reset し index を作り直すため struggle。**QBL.M はそうした index 生成を素早く回避し、no-index に近づいて競争優位**を得る。50GB の大規模データでは**約 40 ラウンド後により明確な優位**。 - index utilization（論文 Figure 3、10GB）: index 読み比率の平均が **QBL.M ≈ 60% / DBABandit ≈ 50% / HMAB ≈ 39%**。QBL.M はプラットフォーム非依存に約 10% 高い index 利用率を維持（ただし utilization 改善が必ずしも実行時間改善に直結しない点も注記）。 - **模擬非定常環境での dynamic regret（§5.2）**: 3 つの adversarial 環境生成器（**Mod2**: 偶奇 arm で reward を swap、**Stochastic constrained**: Beta(5,1)/Beta(4,20) で high/low reward 割当、**Tent map**: 確率 0.1 で binary reward が反転）。reward shift は $3, 3^2, 3^3, \dots$ ラウンドで progressively 遅くなる（Zimmert & Seldin, 2021 に倣う）。時間 horizon $T = 10^5$、10 instance 平均。 - single-choice（Figure 4、$k \in \{20, 100, 1000, 10000\}$）: 高 k で Exp3 系の確率推定がスケールせず random 寄りになる中、**QBL が広く良好**。ただし完全な定常環境では QBL は優位を失い、Mod2・$k{=}1000$ のように頻繁シフトでは QBL の cynicism が裏目に出る場面もある（Mod2・$k{=}10000$ では最終的に QBL が最良）。 - combinatorial（Figure 5、$k{=}20/200$、$m$ を変化）: QBL.M / Exp3.M Heap は Mod2・Stochastic constrained で良好。**Tent map のような「1 本だけ良い arm」環境は combinatorial policy には不向き**（残り $m{-}1$ が無意味な arm になる）で random が最善になりうる。 - **runtime（§5.3）**: heap ベース版が原版を上回り、**k が大きいほど差が顕著**（Figure 7: single-choice、$T{=}1000$、QBL は update を複数ラウンド skip できるため fast Exp3 変種より一貫して速く、update 時間はほぼ定数）。combinatorial（Figure 6）では $m/k$ が大きいと sum-heap の利点が薄れ Exp3.M Heap は $m$ 増で遅くなるが、**QBL.M は demote 時のみ queue を更新するため $m{=}k$ でも高速**を保つ。 ## 結論・takeaway - adversarial MAB を**計算の観点**から扱い、arm 数 k に sublinear な runtime を目指した。**QBL**: 選択的 weight 更新＋定数時間サンプリングで $\mathcal{O}(m \log k)$ を達成し、単一パラメータ $\gamma$ で定常・非定常の双方に対応（論文 §6）。 - core idea は「**update を skip できる潜在性**」（leader が良く見える間は維持コストを省く）。先行研究の「could-have-been（選ばれたかもしれない）として arm を分類する」発想にも通じる（文字起こし）。framework であり good/bad の判定基準は差し替え可能で、品質保証が必要なら別の選択もありうる（文字起こし）。 - **regret 保証は未解決のまま残る**が、広範な実験で実用的価値を示し、scalable MAB アルゴリズムと controlled regret への一歩とする（論文 Conclusion）。 > [!note] Q&A について > 本トークの後の Q&A はセッション録音終了のため文字起こしに含まれない。