## 総論
- [[2015__NeurIPS__Hidden Technical Debt in Machine Learning Systems]]
- [ML Ops: Machine Learning Operations](https://ml-ops.org/)
- [[Introduction to MLOps]]
## 用語
- [[MLOps]]
- <https://cyberagent.ai/blog/research/12898/>
- [[CACE]]
- [[ML for Systems]]
## 信頼性
- [[2017__Big Data__The ML Test Score - A Rubric for ML Production Readiness and Technical Debt Reduction]]
## ハイパーパラメータの最適化
### Optuna
- [[Optuna多目的最適化]]
- [[OptunaのRDBチューニング]]
- [[2019__KDD__Optuna - A Next-generation Hyperparameter Optimization Framework]]
## [[データサイエンスの実験管理]]
### ハイパーパラメータ管理
- [[Hydra]]
## ワークフロー
- [[Argo Workflows]]
- [[AirFlow]]
- [Metaflow](https://metaflow.org/)
- [[Yahoo!Japan AIPlatformとWorkflow管理 - k8sjp]]
## Books
- [[Reliable Machine Learning - Applying SRE Principles to ML in Production]]
## Papers
- [[2021__A Data Quality-Driven View of MLOps]]
## LLM
- [[大規模言語モデルの事前学習知見を振り返る - Turing tech blog]]
- [[2023__ISSREW__A Survey of Metrics to Enhance Training Dependability in Large Language Models]]
- [[Traceloop]]
- [[OpenLLMetry]]
## Monitoring
- [Symptom-based Alerting for Machine Learning - What I Learned from Monitoring More than 30 Machine Learning Use Cases](https://www.usenix.org/conference/srecon23emea/presentation/weichbrodt)
- [Reliable Data for Large ML Models: Principles and Practices](https://www.usenix.org/conference/srecon23emea/presentation/mcglohon)
## Others
- [【読書】Introducing MLOps](https://zenn.dev/dhirooka/articles/4dec7966a97a16)
- [Improving Machine Learning Development Reliability](https://www.usenix.org/conference/srecon22apac/presentation/hansen)
## 関連 MOC
- [[Systems for ML - MOC]]