## SLI・SLOに関する提案 - [[2024__SOSE__Diffusing High-level SLO in Microservice Pipelines]] - [[2024__ApPLIED__DeepSLOs for the Computing Continuum]] - [[2024__IEEE Internet Computing__On Causality in Distributed Continuum Systems]] - [[2023__SSE__Towards a Prime Directive of SLOs]] - SLO関連の論文のレビュー - [[2021__CLOUD__A Novel Middleware for Efficiently Implementing Complex Cloud-Native SLO]] - [[2021__ICWS__SLO Script - A Novel Language for Implementing Complex Cloud-Native Elasticity-Driven SLOs]] - [[2020__NSDI__Meaningful Availability|Meaningful Availability]] - G Suiteに導入されたUser Uptimeと呼ばれるSLIの話。 - [[2019__HotOS__Nines are Not Enough Meaningful Metrics for Clouds]] - クラウド事業者におけるSLOの定義の難しさについて述べている。 - 著者らは、[[Site Reliability Engineering - Google|srebook]]のchapter 4の共著者。 - [[2017__HotOS__Thinking about Availability in Large Service Infrastructures]] ## インシデントレスポンスにおける原因診断 いずれも、SLOの違反を契機として、原因診断のためのデータ分析アルゴリズムが実行されるというフレームワークにそっている。 - [[2021__CLOUD__Causal Modeling based Fault Localization in Cloud Systems using Golden Signals]] - [[2021__ISSRE__Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems]] - [[2020__WWW__AutoMAP - Diagnose Your Microservice-based Web Application]] - 旧名 WWW のWeb系トップカンファレンスの論文 - 論文中にSite Reliability Engineerがでてくる。 - [[2020__NOMS__MicroRCA - Root Cause Localization of Performance Issues in Microservices]] - [[2020__Applied Science__A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications]] - [[2018__CCGRID__CloudRanger―Root Cause Identification for Cloud Native Systems]] - [[2018__ICSOC__Microscope―Pinpoint Performance Issues with Causal Graphs in Micro-service Environments|Microscope]] - [[2014__INFOCOM__CauseInfer―Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems]] ## SLIを目的関数とした自動制御 - [[2025__arXiv__Tempo - Application-aware LLM Serving with Mixed SLO Requirements]] - [[2025__CLOUD__SLO-Aware Container Orchestration on Kubernetes Clusters]] - MSARS: [[2024__arXiv__MSARS - A Meta-Learning and Reinforcement Learning Framework for SLO Resource Allocation and Adaptive Scaling for Microservices]] - Octopus: [[2024__CLOUD__Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds]] - [[2024__DSN__When Green Computing Meets Performance and Resilience SLOs]] - [[2018__ATC__SLAOrchestrator - Reducing the cost of performance SLAs for cloud data analytics]] - RedShiftなどのデータ分析用のOLAPを対象に、マルチテナントでクエリが発行される環境で、クエリ実行時間などのSLAを保証しながら、インスタンス数やインスタンスサイズを調整する。