博論要旨 - yuuk1's Digital Garden

Cloud computing has revolutionized the delivery of online services, leading to increasingly distributed and complex applications. Ensuring reliability in these systems requires extensive telemetry data, which provides systematic measurements of system behavior. These data, comprising time series, numerical values, strings, and topology information, underpin critical operational functions such as failure management, performance optimization, capacity planning, and security auditing. Telemetry processing consists of three primary layers: instrumentation for data collection, storage for data retention, and mining for data analysis. As applications scale, the volume of telemetry data grows, presenting significant challenges in managing and processing these data efficiently. Existing telemetry systems struggle to scale effectively while maintaining operational simplicity. The increasing data volume exacerbates challenges across all three layers. Instrumentation consumes additional system resources, storage systems face escalating ingestion and retention demands, and the effectiveness of machine learning-based analysis diminishes with expanding datasets. These issues are compounded by the operational burden on system administrators, who must balance application and telemetry system management. Current solutions often require extensive manual intervention, specialized expertise, or complex infrastructure, making large-scale deployment difficult. This dissertation addresses three key challenges in scaling telemetry workloads. First, instrumenting detailed telemetry data for continuous network topology discovery imposes significant resource overhead on target applications. Second, storing vast quantities of time series data strains computational resources and increases storage costs, especially for long retention periods. Third, mining large datasets for root cause analysis becomes increasingly difficult and time-consuming as data volumes grow. To overcome these challenges, we propose three novel approaches, each targeting a specific layer of telemetry processing while prioritizing practical operational requirements. At the instrumentation layer, we introduce an efficient network topology discovery method that operates transparently to applications. By bundling related network flows within the operating system kernel, this approach significantly reduces computational overhead without requiring application code modifications, maintaining minimal CPU usage even under high network loads. For the storage layer, we present a hierarchical data-intensive application architecture that balances high-throughput data ingestion with cost-effective long-term storage. This architecture integrates memory-based and disk-based database systems through automated data tiering, achieving higher ingestion rates while simplifying operational management. The solution leverages widely-used database technologies rather than custom implementations, making it practical for deployment in cloud environments. At the mining layer, we develop a feature reduction framework that improves automated fault localization by reducing the number of time series data points to analyze. This framework employs unsupervised learning techniques to identify failure-related patterns in system behavior, eliminating the need for manually labeled training data while maintaining high performance with minimal parameter tuning. We validate these approaches through experiments using real-world implementations and benchmark evaluations. The results demonstrate significant performance improvements over existing solutions while maintaining operational simplicity across all three layers. These contributions lay the foundation for building scalable and maintainable telemetry systems capable of supporting the reliability requirements of modern cloud applications. Cloud computing has transformed how online services are delivered, with applications becoming increasingly distributed and complex. To maintain reliability in these systems, operators rely on various telemetry data, which are systematic measurements of system behavior collected through instrumentation. These telemetry data, comprising time, numeric, string and topology information, serves as the foundation for critical operational functions including failure management, performance optimization, capacity planning, and security auditing. The telemetry process encompasses three fundamental layers: instrumentation for data collection, storage for data retention, and mining for data analysis. As applications scale, they generate large volumes of telemetry data across all these layers. However, existing telemetry systems have long been a challenge in scaling telemetry data processing while remaining operational complexity manageable. The growing data volume creates cascading problems across the three layers: instrumentation consumes extra resources in application systems, storage systems face increasing ingestion and retention demands, and the effectiveness of machine learning-based analysis deteriorates as data volumes expand. These problems are further complicated by the need to minimize operational burden on system operators who must manage both the applications and their telemetry systems. Existing approaches to scaling telemetry workloads tend to require substantial manual intervention, specialized expertise, or complex custom infrastructure, making them difficult to operate at scale. This dissertation addresses three specific challenges in scaling telemetry workloads. First, instrumenting detailed telemetry data for continuous network topology discovery introduces significant resource overhead in target application systems. Second, storing vast numbers of time series strains computational resources and increases storage costs, particularly when retention periods exceed one year. Third, mining large datasets to identify root causes of failures becomes increasingly difficult and time-consuming as the volume of time series data grows. To address these challenges, we present three approaches, each targeting a specific layer of telemetry processing while prioritizing practical operational requirements. For the instrumentation layer, we introduce an efficient network topology discovery method that operates transparently to applications. By bundling related network flows within the operating system kernel, our approach significantly reduces computational overhead without requiring application code modifications, maintaining minimal CPU usage even under high network loads. For the storage layer, we present a hierarchical data-intensive application architecture that addresses the conflicting demands of high-throughput data ingestion and cost-effective long-term storage. By combining memory-based and disk-based database systems through automated data tiering, our architecture achieves higher ingestion rates while simplifying operations. This solution builds on widely-used database systems rather than custom implementations, making it practical to deploy and operate in cloud environments. For the mining layer, we develop a feature reduction framework that enhances automated fault localization by reducing the large numbers of time series data. Our framework employs unsupervised learning techniques to identify failure-related patterns in system behavior, eliminating the need for manual training data preparation while maintaining robust performance with minimal parameter tuning. These approaches have been validated through experiments with the implementations using actual machines and benchmarkers. The results demonstrate performance improvements over existing approaches with operational simplicity across all three layers. These contributions provide a foundation for building more scalable and maintainable telemetry systems that can effectively support the reliability requirements of modern cloud applications. オンラインサービスクラウドコンピューティングの普及により、に、アプリケーションはますます分散化し、複雑化している。これらのシステムの信頼性を確保するためには、システムの挙動を体系的に測定するテレメトリーデータが不可欠である。テレメトリーデータには、時系列データ、数値データ、文字列データ、トポロジ情報などが含まれ、障害管理、パフォーマンス最適化、キャパシティプランニング、セキュリティ監査といった重要な運用機能の基盤となっている。テレメトリ処理は、データ収集を行う計測層、データ保持を担うストレージ層、データ分析を行うマイニング層の3つの主要な層で構成される。アプリケーションの規模が拡大するにつれ、これらのすべての層でテレメトリデータの量が増加し、その管理と処理が大きな課題となる。本論文は、テレメトリーワークロードの増大に対する３つのスケーリング技法を提案し、それらの有効性を論じたものであり、全６章で構成されている。第１章は序論であり、本研究の背景や目的、関連技術について概観したのちに、テレメトリーシステムの各層における課題と第２章は、第６章では、結論として、クラウドアプリケーションにおけるテレメトリーワークロードスケーリングに関する研究を総括した上で、今後の展望について述べている。既存のテレメトリシステムは、スケールの拡大に伴う課題に対応しつつ、運用のシンプルさを維持するのに苦慮している。データ量の増加は、3つの層すべてにおいて問題を引き起こす。インストゥルメンテーションはアプリケーションのシステムリソースを消費し、ストレージシステムはデータの取り込みと保持の負担が増し、機械学習を用いた分析の効果はデータ量の増加とともに低下する。これらの課題は、システム管理者にとっての運用負担をさらに増大させる。現在のソリューションの多くは、大規模な手作業の介入、専門的な知識、または複雑なインフラストラクチャを必要とし、大規模な導入が困難である。本論文では、テレメトリワークロードのスケーリングに関する3つの主要な課題に取り組む。第一に、ネットワークトポロジーを継続的に発見するための詳細なテレメトリデータのインストゥルメンテーションは、対象アプリケーションに大きなリソース負荷をもたらす。第二に、大量の時系列データの保存は計算資源を圧迫し、特に長期間のデータ保持ではストレージコストが増大する。第三に、大規模なデータセットから障害の根本原因を特定する作業は、データ量の増加に伴いますます困難かつ時間を要するものとなる。これらの課題を解決するために、本研究では、テレメトリ処理の各層に対応した3つの新しいアプローチを提案する。それぞれのアプローチは、実用的な運用要件を重視しながら設計されている。インストゥルメンテーション層では、アプリケーションに影響を与えずに動作する効率的なネットワークトポロジー発見手法を提案する。本手法では、関連するネットワークフローをオペレーティングシステムのカーネル内でバンドル処理することで、アプリケーションコードの変更を必要とせず、計算オーバーヘッドを大幅に削減できる。これにより、高負荷のネットワーク環境下でもCPU使用率を最小限に抑えることが可能となる。ストレージ層では、高スループットのデータ取り込みとコスト効率の良い長期保存の両立を図る階層型データ集約アーキテクチャを提案する。本アーキテクチャでは、メモリベースおよびディスクベースのデータベースシステムを統合し、自動データ階層化を活用することで、高速なデータ取り込みを実現しつつ、運用の簡素化を図る。さらに、独自実装ではなく広く利用されているデータベース技術を活用することで、クラウド環境での実運用が容易となる。マイニング層では、時系列データの量を削減しながら障害の自動特定を改善する特徴量削減フレームワークを開発する。本フレームワークは、教師なし学習を活用してシステムの異常パターンを特定し、手動でラベル付けされた学習データを必要としない。このアプローチにより、パラメータ調整を最小限に抑えながら高い精度を維持できる。提案する手法の有効性は、実機環境およびベンチマーク評価を通じて検証を行った。その結果、既存の手法と比較して、パフォーマンスの大幅な向上と運用の簡素化を両立できることを確認した。これらの貢献により、現代のクラウドアプリケーションの信頼性要件を満たす、スケーラブルかつ保守性の高いテレメトリシステムの構築に向けた基盤を提供する。 --- クラウド上で提供されるオンラインサービスの信頼性向上には、システムの動作を測定するテレメトリーデータが不可欠であり、そのデータ処理は計測、ストレージ、マイニングの3つの層で構成され、サービスの規模拡大に伴いデータ量も増大する。本論文では、データ量の増大に伴うテレメトリーシステムの各層におけるワークロードの増大に由来する課題を、運用複雑性を増大させることなく解決するための3つの技法を提案し、その有効性を論じており、全６章から構成されている。第１章では、本研究の背景と目的を概観し、本論文で扱う、テレメトリーシステムにおける計測層の計算資源利用量の増大、ストレージ層のデータ取り込み負荷増大と保存領域の拡大、および、マイニング層の機械学習による解析精度と実行時間の低下の課題について説明している。第２章では、クラウドアプリケーションのアーキテクチャと信頼性、および、テレメトリーシステムに関する基礎知識を体系的に整理し、関連する研究と技術の課題について述べている。第３章では、計測層にて、アプリケーションに変更を加えることなく、カーネル空間でネットワーク呼び出しを効率的に計測する技法について述べている。関連するネットワークフローをオペレーティングシステムのカーネル内で集束処理することで、アプリケーションコードの変更を必要とせず、高負荷のネットワーク環境下であっても計測オーバーヘッドを大幅に削減できることが示されている。第４章では、ストレージ層における、高スループットのデータ取り込みとコスト効率の良い長期保存の両立を図る階層型データ指向アプリケーションアーキテクチャについて述べている。メモリベースおよびディスクベースのデータベースシステムを統合し、自動でデータを階層化させることにより、広く利用されているデータベース技術の範囲内で、効率的なデータ取り込みを実現できることが示されている。第５章では、マイニング層における、時系列データの時系列数を削減しながら障害の自動特定を改善する特徴量削減フレームワークについて述べている。教師なし学習を活用して異常が発生した時間帯に集中する特徴を自動抽出することにより、正確に障害期間を特定した上で、障害と無関係な時系列データを効率よく削減できることが示されている。第６章では、結論として、クラウドアプリケーションにおける運用複雑性を考慮したテレメトリーワークロードのスケーリングに関する研究を総括した上で、今後の展望について述べている。マイニング層における、時系列データの特徴削減による故障特定の精度向上に取り組む。特徴量の削減に基づく新しいフレームワークを開発し、教師なし機械学習を用いた障害解析の精度と処理速度を向上させる。本手法では、異常が発生した時間帯に集中する特徴を自動抽出し、手動によるパラメータ調整なしで高い精度を実現する。テレメトリーデータをストレージに取り込む際の計算負荷とデータ保存コストを低減するためのデータベースアーキテクチャについて述べている。従来手法は、データの変数の個数が増大し、データベース内の索引構造の規模が増大したときに、データの挿入時の検索処理が低速となることに対して、 1066字。本論文は、クラウド上に展開されるアプリケーションシステムをオペレーターが監視・分析するためのテレメトリーシステムにおいて、計測・ストレージ・マイニングの各層におけるデータ処理負荷の増大という問題を取り上げている。そして、テレメトリーシステムの運用性を損ねうことなく、データ処理負荷の増大に対してスケーラブルに対応できる技術を提案し、その有効性について論じている。本論文の貢献は、主に以下の３点に集約される。第一の貢献は、計測層において、アプリケーションコードの改変を必要とせず、OSカーネル内でネットワーク通信を効率的に計測する手法に関するものである。従来の手法では、多数の短命なネットワーク接続が発生する環境において、カーネル空間からユーザー空間への計測データ転送にかかるオーバーヘッドが顕著であった。提案手法では、関連するネットワーク接続を統合することで、計測データの転送数を従来手法より大幅に削減し、ネットワーク接続数が増大しても低い計測負荷を維持できることを示している。第二の貢献は、ストレージ層において、時系列のテレメトリーデータを保存するための、メモリベースとディスクベースのデータベースを疎結合化した階層型ストレージアーキテクチャを提案した点である。従来のアーキテクチャでは、ディスクアクセス効率を優先するため、メモリアクセスの効率が制約される問題がある。提案するアーキテクチャでは、メモリベースのデータベースに格納されたハッシュ表を用いたデータ構造に新しいデータを書き込み、古いデータをディスクベースのデータベースへ移行する方式をとる。これにより、長期データ保存のコスト効率を損なうことなく、特に時系列データの数が多い環境下において、従来のアーキテクチャより大幅に高いデータ取り込み性能を達成できることを示している。第三の貢献は、マイニング層において、時系列データを用いた障害原因の自動特定処理の前処理として、障害に無関係なデータを自動的に削減する手法に関するものである。従来の前処理手法では、データの削減が過剰または過小となることで、障害特定の精度が十分でない課題がある。提案手法では、障害の原因となる変化点に着目し、変化点の時間分布に基づいて最も時間的近接性が高い範囲を障害期間とみなすことで、理想に近い精度で障害に無関係なデータを特定する。さらに、教師なし学習の枠組みで設計されているため、事前の学習データやラベリングを必要としないだけでなく、変化点検知のパラメータ設定が最適でなくても、その後段の処理により精度の著しい低下を防ぐことができる。このため、パラメータチューニングにかかる作業負担の軽減にも寄与する。複数の既存の障害特定アルゴリズムと統合した評価により、障害特定精度の向上と処理時間の短縮が実現できることを示している。本研究は、テレメトリーシステムの運用に伴うワークロードの増大に起因する計算資源消費の増大や、システムの複雑化に関する課題を体系的に整理し、それぞれの課題を解決する新しい技術を提案し、その有効性を実証した点に学術的貢献がある。また、提案した技術はいずれもソフトウェアとして実装され、公開されているかまたは実際の事業者に導入されている点において、社会的貢献がある。本研究は、テレメトリーシステムのワークロードの増大に対する計算機資源消費と運用複雑性に伴う複雑な課題を体系的に整理した上で、それぞれの課題を解決するために新しい技法を提案およびその有効性を実証している点に学術的貢献がある。また、各技法をソフトウェアとして実装し公開しており、その一部については実際の事業者にて導入されていることに社会的貢献がある。上記３つの要素技術を組み合わせることで、テレメトリーシステム全体の効率性とスケーラビリティを向上させるための統合的なアプローチを提示している点に新規性がある。また、各貢献は、既存の広く利用されている技術を基盤としており、実システムへの導入・展開が容易である点は実用的な意義がある。さらに、本論文では、テレメトリーシステムの設計指針として、コンテキストに応じたデータ削減の重要性を示唆している。具体的には、各レイヤー(計測層、マイニング層)においてコンテキスト情報を活用し、より効果的なデータ削減を行う指針を提示している。これらの貢献により、大規模なクラウドアプリケーションの運用において、増大するテレメトリーデータに効率的に対処し、システムの信頼性向上と運用コストの最適化に寄与することが期待される。