## 総論 - [[2015__NeurIPS__Hidden Technical Debt in Machine Learning Systems]] - [ML Ops: Machine Learning Operations](https://ml-ops.org/) - [[Introduction to MLOps]] ## 用語 - [[MLOps]] - <https://cyberagent.ai/blog/research/12898/> - [[CACE]] - [[ML for Systems]] ## 信頼性 - [[2017__Big Data__The ML Test Score - A Rubric for ML Production Readiness and Technical Debt Reduction]] ## ハイパーパラメータの最適化 ### Optuna - [[Optuna多目的最適化]] - [[OptunaのRDBチューニング]] - [[2019__KDD__Optuna - A Next-generation Hyperparameter Optimization Framework]] ## [[データサイエンスの実験管理]] ### ハイパーパラメータ管理 - [[Hydra]] ## ワークフロー - [[Argo Workflows]] - [[AirFlow]] - [Metaflow](https://metaflow.org/) - [[Yahoo!Japan AIPlatformとWorkflow管理 - k8sjp]] ## Books - [[Reliable Machine Learning - Applying SRE Principles to ML in Production]] ## Papers - [[2021__A Data Quality-Driven View of MLOps]] ## LLM - [[大規模言語モデルの事前学習知見を振り返る - Turing tech blog]] - [[2023__ISSREW__A Survey of Metrics to Enhance Training Dependability in Large Language Models]] - [[Traceloop]] - [[OpenLLMetry]] ## Monitoring - [Symptom-based Alerting for Machine Learning - What I Learned from Monitoring More than 30 Machine Learning Use Cases](https://www.usenix.org/conference/srecon23emea/presentation/weichbrodt) - [Reliable Data for Large ML Models: Principles and Practices](https://www.usenix.org/conference/srecon23emea/presentation/mcglohon) ## Others - [【読書】Introducing MLOps](https://zenn.dev/dhirooka/articles/4dec7966a97a16) - [Improving Machine Learning Development Reliability](https://www.usenix.org/conference/srecon22apac/presentation/hansen) ## 関連 MOC - [[Systems for ML - MOC]]