2020__SAC__Black-box inter-application traffic monitoring for adaptive container placement

## Memo - [[Transtracer]]の関連研究 ## Abstract A key issue in the performance of modern containerized distributed systems, such as big data storage and processing stacks or micro- service based applications, is the placement of each container, or container pod, in virtual and physical servers. Although it has been shown that inter-application traffic is an important factor in placement decisions, as it directly indicates how components interact, it has not been possible to accurately monitor it in an application independent way, thus putting it out of reach of cloud platforms. ビッグデータのストレージと処理スタックやマイクロサービスベースのアプリケーションなど、最新のコンテナ化分散システムのパフォーマンスにおける重要な問題は、仮想サーバーと物理サーバーにおける各コンテナ（コンテナポッド）の配置です。アプリケーション間のトラフィックは、コンポーネントがどのように相互作用するかを直接示すため、配置の決定において重要な要素であることが示されていますが、アプリケーションに依存しない方法で正確に監視することはできず、クラウドプラットフォームの手の届かないところに置かれています。 In this paper we present an efficient black-box monitoring ap- proach for detecting and building a weighted communication graph of collaborating processes in a distributed system that can be queried for various purposes, including adaptive placement. The key to achieving high detail and low overhead without custom application instrumentation is to use a kernel-aided event driven strategy. We evaluate a prototype implementation with micro-benchmarks and demonstrate its usefulness for container placement in a distributed data storage and processing stack (i.e., Cassandra and Spark). 本論文では、分散システムにおける協調プロセスの重み付き通信グラフを検出して構築するための効率的なブラックボックスモニタリング手法を紹介します。カスタムアプリケーションのインストルメンテーションなしで、高い詳細度と低いオーバーヘッドを達成するための鍵は、カーネル支援イベント駆動戦略を使用することである。プロトタイプの実装をマイクロベンチマークで評価し、分散データストレージと処理スタック(すなわち、CassandraとSpark)におけるコンテナ配置に有用であることを実証する。 ## Introduction Distributed applications including multiple components and mul- tiple instances of each component are increasingly managed with containers, container pods, and container orchestrators. Containers provide a lightweight virtualization technology in which the op- erating system kernel and many resources can be shared between co-located components, reducing the resulting overhead. Container pods formalize this by making it easy to share specific resources and manage together tightly coupled components. Orchestrators such as Kubernetes take care of managing a pool of resources and placing component instances in them to compose complete services and applications. 複数のコンポーネントや各コンポーネントの複数のインスタンスを含む分散アプリケーションは、コンテナ、コンテナポッド、コンテナオーケストレータで管理されることが多くなってきています。コンテナは軽量な仮想化技術を提供します。この技術では、運用中のシステムカーネルと多くのリソースを、同居しているコンポーネント間で共有することができ、結果として生じるオーバーヘッドを減らすことができます。コンテナポッドは、特定のリソースを簡単に共有し、緊密に結合されたコンポーネントを一緒に管理できるようにすることで、これを形式化しています。Kubernetes のようなオーケストレータは、リソースのプールを管理し、その中にコンポーネントのインスタンスを配置して、完全なサービスやアプリケーションを構成します。 A key issue in determining the performance of an application is the amount of data exchanged between different components. This has been exploited at the VM level in cloud-based environments \[5\] and at the process level in NUMA servers \[14\]. Accurately monitor- ing inter-application traffic without instrumenting application, as needed to dynamically determine the best placement, is however hard to achieve. First, tracing tools that provide detailed informa- tion about data flow need changes to the application \[17\], thus making them unusable at the cloud platform level to be offered as a service. Intercepting and processing all network traffic would be possible in the platform, but would incur in a very large overhead. Kernel-based tracing as in WeaveWorks Scope \[20\] automatically detects processes, virtualized containers and hosts and established connections. Briefly, it captures new connections an closed connec- tions and updates a communication graph accordingly. However, it falls short in quantifying the amount of data exchanged, as it cap- tures only connection establishment and tear-down events.1 As we show, directly generalizing Scope’s approach imposes a significant overhead. アプリケーションのパフォーマンスを決定する上で重要な問題は、異なるコンポーネント間で交換されるデータ量です。これは、クラウドベースの環境では VM レベルで利用されており \[5\]、NUMA サーバではプロセスレベルで利用されています \[14\]。しかし、最適な配置を動的に決定するために必要なアプリケーションを計測することなく、アプリケーション間のトラフィックを正確に監視することは困難です。まず、データフローに関する詳細な情報を提供するトレースツールは、アプリケーションに変更を加える必要があります\[17\]。すべてのネットワークトラフィックを傍受して処理することはプラットフォームで可能ですが、非常に大きなオーバーヘッドが発生します。WeaveWorks Scope \[20\]のようなカーネルベースのトレースは、プロセス、仮想化コンテナ、ホスト、および確立された接続を自動的に検出します。簡単に言えば、新しい接続や閉じた接続をキャプチャし、それに応じて通信グラフを更新します。しかし、コネクションの確立とティアダウンイベントのみをキャプチャするため、交換されたデータ量を定量化するには不十分です1。 In this paper, we aim at efficient monitoring of distributed sys- tems without requiring application-specific instrumentation or knowledge. We achieve it by using eBPF (see Section 2) to intercept key Linux system calls and by judiciously aggregating information in the operating system kernel, within the confines of the limited ability of kernel probes. For purposes of evaluation, we implement a prototype monitoring agent and show that, even operating in a black-box fashion, it builds a weighted graph representation of the system reflecting the amount of data exchanged, and with negligi- ble overhead. As a second contribution, we present an extensive case study detailing how inter-application traffic can be used for container placement, both automatically by the cloud platform itself and by human operators. 本論文では、アプリケーション固有の計測器や知識を必要とせず、分散システムの効率的な監視を目指しています。我々は、カーネルプローブの限られた能力の範囲内で、主要なLinuxシステムコールを傍受するためにeBPF(セクション2参照)を使用し、オペレーティングシステムカーネル内の情報を慎重に集約することによって、それを達成しています。評価の目的で、プロトタイプのモニタリングエージェントを実装し、ブラックボックスで動作しても、交換されたデータ量を反映したシステムの重み付きグラフ表現を構築し、オーバーヘッドを最小限に抑えることを示します。2つ目の貢献として、アプリケーション間のトラフィックをどのようにしてクラウドプラットフォーム自体と人間のオペレータの両方が自動的にコンテナを配置するために使用できるかを詳細に説明する広範なケーススタディを紹介します。 The rest of this paper is organized as follows. Section 2 describes existing operating system tracing approaches. Section 3 describes the design and implementation of our approach. Section 4 validates the proposed approach and Section 5 applies it to a data process- ing system. Finally, Section 6 discusses related previous work and Section 7 concludes the paper. 本稿の残りの部分は以下のように構成されている。第2節では、既存のオペレーティングシステムのトレースアプローチについて説明する。第3節では、我々のアプローチの設計と実装について述べる。第4節では提案されたアプローチを検証し、第5節ではデータ処理システムに適用する。最後に、第6節では関連する過去の研究について議論し、第7節で本稿を締めくくる。 ## 2 BACKGROUND The observation of system calls provides useful insights on which and how components are interacting \[10\]. Our approach builds on using eBPF to intercept the desired information. In this section we provide a brief introduction to this technology and its limita- tions, and to previous usage of eBPF to monitor inter-application connections \[20\]. システムコールを観察することで、コンポーネントがどのように相互作用しているかを知ることができます\[10\]。我々のアプローチは、目的の情報を傍受するために eBPF を使用することに基づいています。このセクションでは、この技術とその限界、およびアプリケーション間接続を監視するための eBPF の以前の使用法 \[20\]について簡単に紹介します。 ## 2.1 Monitoring with eBPF Efficient interception of the execution path of system calls in Linux is currently performed through Extended Berkeley Packet Filter (eBPF), an increasingly popular technology for executing programs passed from user space to kernel. The fundamental idea of eBPF is to attach small custom programs to the available kernel tracepoints and to the entry and exit points of kernel routines. The attached pro- grams are compiled and then executed within a virtual machine in an event-driven fashion, namely at every moment a specific kernel routine is called or returns or when a tracepoint event is dispatched by the kernel. These programs are also capable of performing some sort of filtering, keeping state in data structures (e.g., hashes, arrays, etc.) throughout probe invocation and send events from kernel to user space using ring buffers, which are collected by a frontend program. Linux におけるシステムコールの実行パスの効率的な傍受は、現在、ユーザ空間からカーネルに渡されたプログラムを実行するための技術としてますます普及している Extended Berkeley Packet Filter (eBPF) を通して行われています。eBPF の基本的な考え方は、利用可能なカーネルのトレースポイントとカーネルルーチンのエントリポイントとエグジットポイントに小さなカスタムプログラムをアタッチすることです。アタッチされたプログラムはコンパイルされ、仮想マシン内でイベント駆動で実行されます。つまり、特定のカーネルルーチンが呼び出されたり戻ったり、カーネルによってトレースポイントイベントがディスパッチされるたびに実行されます。これらのプログラムはまた、何らかのフィルタリングを実行したり、プローブの呼び出し中にデータ構造（ハッシュ、配列など）に状態を保持したり、リングバッファを使用してカーネルからユーザ空間にイベントを送信したりすることができます（フロントエンドプログラムによって収集されます）。 When eBPF programs are installed, they become part of the exe- cution path of instrumented kernel routines. Aiming at avoiding kernel panics, data corruption, unbounded overhead and other dan- gerous consequences that compromise correctness, eBPF imposes several restrictions on what can be done in kernel. For instance, loops are not allowed, reading fields of kernel structures requires previously copying them and stack size is limited to 512 bytes. Such restrictive technology demands planning variables declaration, def- inition and instantiation of data structures. Additionally, choosing the kernel routines that provide direct access to data of interest is recommended, otherwise resulting copies of navigating throughout kernel structures for accessing fields would rapidly hit stack size limit. eBPF プログラムがインストールされると、それらはインスツルメンテーションされたカーネルルーチンの実行パスの一部となります。カーネルパニック、データ破損、無制限のオーバーヘッド、その他の正しさを損なうような重大な結果を回避することを目的として、eBPFはカーネル内でできることにいくつかの制限を課しています。例えば、ループは許可されておらず、カーネル構造体のフィールドを読むには事前にコピーする必要があり、スタックサイズは512バイトに制限されています。このような制限のある技術では、変数の宣言、定義、データ構造のインスタンス化を計画する必要があります。さらに、関心のあるデータへの直接アクセスを提供するカーネルルーチンを選択することが推奨され、そうでなければ、フィールドへのアクセスのためにカーネル構造体全体をナビゲートする結果のコピーは、スタックサイズの制限に急速にヒットしてしまいます。 ## 2.2 Monitoring Connections A new network connection is established, independently of the programming language, when system calls connect and accept com- plete and return, respectively, at the client and at the server. The connection is then identified by two IP:PORT pairs. One identifies the client’s endpoint and other the server’s one. In the Linux kernel, it results in a new struct sock data structure, which contains the local and remote addresses and ports. In order to match a client’s connect with a server’s accept, the local address field must match the remote field in the remote process and vice-versa. システムコールがクライアントとサーバでそれぞれ接続したり、接続を受け入れたりすると、プログラミング言語とは無関係に新しいネットワーク接続が確立されます。接続は、2 つの IP:PORT ペアによって識別されます。1 つはクライアントのエンドポイントを、もう 1 つはサーバのエンドポイントを識別します。Linux カーネルでは、ローカルとリモートのアドレスとポートを含む新しい struct sock データ構造が生成されます。クライアントの接続とサーバのアクセプトを一致させるためには、ローカルアドレスフィールドがリモートプロセスのリモートフィールドと一致しなければならず、その逆も同様です。 Instrumenting syscalls directly does not provide the required information to identify connected processes, as file descriptors are meaningless outside the process context. Consequently, instrumen- tation needs to be performed at a lower level in the call stack of each connect/accept syscall. The kernel routines that handle structures containing relevant information are tcp\_connect and inet\_csk\_accept. They can be intercepted to dispatche CONNECT events to the user space, containing details of both the process and the connection. ファイル記述子はプロセスコンテキストの外では無意味であるため、システムコールを直接インストルメンテーションしても、接続されたプロセスを識別するために必要な情報は提供されません。その結果、計測は各 connect/accept システムコールのコールスタックの下位レベルで実行される必要があります。関連情報を含む構造体を扱うカーネルルーチンは tcp\_connect と inet\_csk\_accept です。これらは、プロセスと接続の両方の詳細を含む CONNECT イベントをユーザ空間にディスパッチするためにインターセプトすることができます。 The logic for capturing closed communication channels is a bit distinct from the opening, since a connection may be closed for several reasons besides the explicit close of its corresponding file descriptor. The kernel routine that changes the state of a socket, i.e., tcp\_set\_state, is intercepted and it checks if the new state is set to closed. If so, a CLOSE event is dispatched to user space. 接続は、対応するファイルディスクリプタの明示的なクローズ以外にもいくつかの理由でクローズされる可能性があるため、クローズされた通信チャネルをキャプチャするロジックは、オープンとは少し異なります。ソケットの状態を変更するカーネルルーチン(tcp\_set\_state)を傍受し、新しい状態がclosedに設定されているかどうかをチェックします。もしそうであれば、CLOSE イベントがユーザ空間にディスパッチされる。 ## 3 CAPTURING NETWORK METRICS Our proposal is also to use eBPF to intercept key system calls, but aggregate information within the operating system kernel, within the confines of the limited ability of kernel probes, to limit the amount of data that has to be copied to user mode to assemble the weighted communication graph. 我々の提案はまた、キーシステムコールを傍受するためにeBPFを使用するが、カーネルプローブの限られた能力の範囲内で、オペレーティングシステムカーネル内の情報を集約し、重み付き通信グラフを組み立てるためにユーザモードにコピーされなければならないデータ量を制限することである。 ### 3.1 Monitoring Traffic To use the same approach for measuring traffic we need to consider that, for sending and received messages operations, there are several possible syscalls that can be executed by processes. Specifically, write, sendto and sendmsg can be used to write data to a network communication channel whereas read, recvfrom and recvmsg allows processes to read data from the network channel. トラフィックを測定するために同じアプローチを使用するには、メッセージの送信と受信の操作のために、プロセスが実行することができるいくつかの可能なシスコールがあることを考慮する必要があります。具体的には、write、sendto、sendmsgはネットワーク通信チャネルにデータを書き込むために使用することができ、read、recvfrom、recvmsgはプロセスがネットワークチャネルからデータを読み出すことを可能にします。 In order to measure connection traffic kernel routines with ar- guments containing connection details and size of data need to be instrumented. Specifically, the immediate kernel routines that are common execution paths of syscalls for writing to and reading from network channels are sock\_sendmsg and sock\_recvmsg, respectively. Intercepting the execution of such routines enables us to continu- ously aggregate the total amount of data transferred through each connection. 接続トラフィックを測定するためには、接続の詳細やデータのサイズを含む引数を持つカーネルルーチンを計測する必要があります。具体的には、ネットワークチャネルへの書き込みやネットワークチャネルからの読み込みのためのシスコールの共通の実行パスである即時カーネルルーチンは、それぞれ sock\_sendmsg と sock\_recvmsg です。このようなルーチンの実行を傍受することで、各接続を介して転送されたデータの総量を継続的に集計することができます。 The aggregation of traffic per connection can be performed in user and kernel space. The former triggers events on every send or receive call which are then collected and processed by a frontend program running in user space. This is easy to implement, as it is the default usage of eBPF programs. 接続ごとのトラフィックの集約は、ユーザ空間とカーネル空間で行うことができます。前者は、送信または受信コールごとにイベントをトリガーし、ユーザ空間で実行されるフロントエンドプログラムによって収集・処理されます。これは eBPF プログラムのデフォルトの使用法なので、実装は簡単です。 The latter approach is harder, as it requires information to be stored and correlated within the kernel using the limited resources provided by eBPF and decrease the amount of events passed to user space. 後者のアプローチは、eBPFによって提供される限られたリソースを使用してカーネル内に情報を格納して相関させる必要があり、ユーザ空間に渡されるイベントの量を減らす必要があるため、より困難です。 In detail, kernel probes are attachable at two moments: at the entry point of a kernel routine, i.e. kprobe, and when it returns, i.e. kretprobe. At the entry point, kprobes is able to inspect the arguments of the associated kernel routine. When the kernel routine returns, kretprobes can read the returned value, if any. To extract useful data from kernel probes, it is necessary to store temporary state between both points of instrumentation. For instance, to make a correspondence between returned values and arguments of a kernel routine. 詳細には、カーネルプローブは、カーネルルーチンのエントリポイント、つまり kprobe と、それが戻ってきたとき、つまり kretprobe の 2 つの瞬間にアタッチ可能です。エントリーポイントでは、kprobe は関連するカーネルルーチンの引数を検査することができます。カーネルルーチンが戻るとき、kretprobes は、返された値があれば、その値を読み取ることができます。カーネルプローブから有用なデータを抽出するためには、計測の両ポイント間の一時的な状態を格納することが必要である。例えば、返された値とカーネルルーチンの引数の間の対応を作るために。 In practice, as processes may be calling and returning from this routine interchangeably, it is mandatory to ensure that the returned amount of bytes is associated to the corresponding sock argument. To this end, and leveraging eBPF’s map structures, we keep <pid, sock> entries between kprobe and kretprobe calls, where pid is the kernel identifier of the process and sock the socket it is writing to. Therefore, by accessing pid, we are able to recover the related socket at the exit point and aggregate the amount of bytes. 実際には、プロセスがこのルーチンを交互に呼び出したり返したりすることがあるので、返されるバイト数が対応する sock 引数に関連付けられていることを確認することが必須です。この目的のために、eBPF のマップ構造を利用して、kprobe と kretprobe の呼び出しの間に <pid, sock> エントリを保持します。したがって、pidにアクセスすることで、終了点で関連するソケットを回収し、バイト数を集約することができます。 Moreover, we use a second map to store data transmitted so far for each connection. For probe, we update the total and generate new events only if an elapsed time or amount of data threshold is met. Finally, on connection close, the final traffic amount is also sent to user space. さらに、これまでに送信されたデータを各接続ごとに保存するために、第2のマップを使用します。プローブについては、経過時間やデータ量のしきい値を満たした場合にのみ、合計を更新し、新たなイベントを生成する。最後に、接続終了時には、最終的なトラフィック量をユーザ空間にも送信します。 ## 3.2 Building the Communication Graph Monitoring kernel routines at each endpoint is key for effectively extracting interactions of running processes. However, the data collected on each host only contains the view of one of the two ends in a network channel. 各エンドポイントでカーネルルーチンを監視することは、実行中のプロセスの相互作用を効果的に抽出するための鍵となります。しかし、各ホストで収集されたデータには、ネットワークチャネル内の2つのエンドポイントのうちの1つのビューしか含まれていません。 All events are routed to a central location for processing, that correlates them to establish an interaction between pairs of pro- cesses. It consists of pairing the local and remote address fields with the remote and local fields of other host and thus, it establishes an inter-process interaction. Established interactions can be stored in a graph database, which nodes represent processes and edges the established connections between a pair of processes. The graph representation of inter-process communication enables operators to query the graph for analyzing system’s composition and com- munication between components. For this paper, we chose Apache Kafka for message queuing and Neo4j for storing and querying the communication graph representation. すべてのイベントは、処理のために中央の場所にルーティングされ、プロセスのペア間の相互作用を確立するためにそれらを相関させます。これは、ローカルとリモートのアドレスフィールドを他のホストのリモートとローカルのフィールドとペアリングすることで構成され、プロセス間の相互作用を確立します。確立された相互作用は、ノードがプロセスを表し、エッジがプロセスのペア間の確立された接続を表すグラフデータベースに格納することができます。プロセス間通信をグラフで表現することで、オペレータはグラフを照会してシステムの構成やコンポーネント間の相互作用を分析することができる。本論文では、メッセージキューイングにApache Kafkaを、通信グラフ表現の格納・検索にNeo4jを採用した。 ## 4 EVALUATION To evaluate the proposed approach we first show how eBPF instru- mentation behaves when generating a large number of events, as would result from directly measuring network traffic without any in-kernel aggregation. Then, we evaluate the overhead introduced by our approach to measuring network traffic. 提案アプローチを評価するために、まず、多数のイベントを生成したときに、カーネル内のアグリゲーションを行わずにネットワークトラフィックを直接測定した結果として、eBPFのインストラクションがどのように振る舞うかを示します。次に、ネットワークトラフィックを測定するための我々のアプローチによって導入されるオーバーヘッドを評価します。 For both experiments, we use iperf as a worst-case workload. It is a syscall-intensive network I/O tool for measuring the max- imum achievable bandwidth on IP networks. We set up a single n1-standard-2 instance in Google Cloud Platform equipped with 2 vCPUs and 7.5GB RAM. Within the same instance, we instantiated a localhost iperf server and a client that sends, for six minutes, fixed size messages to the server in order to measure the maximum bandwidth. どちらの実験でも、ワーストケースのワークロードとして iperf を使用しています。iperf は、IP ネットワーク上で最大達成可能な帯域幅を測定するためのシステムコール集約的なネットワーク I/O ツールです。Google Cloud Platformに、2つのvCPUと7.5GBのRAMを搭載した単一のn1-standard-2インスタンスをセットアップしました。同じインスタンス内で、ローカルホストの iperf サーバとクライアントをインスタンス化し、最大帯域幅を測定するためにサーバに固定サイズのメッセージを 6 分間送信しました。 ### 4.1 eBPF Scalability To push an event from kernel to user space, an eBPF program builds a data structure and writes it in a ring buffer connecting both in- kernel program and a frontend program running in user space. The size of the ring buffer is the same as set by default by Scope \[20\], i.e., 256 pages of 4KB. When the ring buffer hits the maximum size, new events overwrite the oldest ones, leading to an increased loss rate. カーネルからユーザ空間にイベントをプッシュするために、eBPFプログラムはデータ構造を構築し、カーネル内プログラムとユーザ空間で動作するフロントエンドプログラムの両方を接続するリングバッファに書き込みます。リングバッファのサイズは、Scope\[20\]でデフォルトで設定されているものと同じで、4KBの256ページです。リングバッファが最大サイズに達すると、新しいイベントが古いイベントを上書きしてしまい、損失率が増大します。 Figure 1 depicts how the loss rate evolves when throughput increases. In iperf, as the size of each message sent from client to server is reduced, the amount of executed syscalls for the same amount of data increases and so the amount of events pushed from kernel to user space. 図 1 は、スループットが増加すると損失率がどのように変化するかを示しています。iperf では、クライアントからサーバに送信される各メッセージのサイズが小さくなると、同じ量のデータに対して実行されるシスコールの量が増え、カーネルからユーザ空間にプッシュされるイベントの量が増えます。 High loss rates mean the kernel program is producing events faster than the frontend program is consuming them. Even allo- cating more pages for ring buffer, which maintain events stored for a longer time and possibly contributing for lower loss rate, the frontend program still needs to process all pushed events, which contributes to resource consumption, as we will show next. This clearly shows why the naive extension of the technique used in Scope \[20\] to monitor inter-application traffic is not viable. 損失率が高いということは、カーネルプログラムがフロントエンドプログラムがイベントを消費するよりも速くイベントを生成していることを意味します。リングバッファのページ数を増やすことでイベントの保存時間を長くし、損失率を下げることができますが、フロントエンドプログラムはプッシュされたイベントをすべて処理する必要があるため、次に示すようにリソースを消費します。このことは、Scope \[20\] でアプリケーション間のトラフィックを監視するために使用されている技術のナイーブな拡張が実行可能ではない理由を明確に示しています。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/51cbb24e-d24d-4cb8-8ecf-fd3b00f0c81a/ScreenShot_2021-02-28_at_16.43.56.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/51cbb24e-d24d-4cb8-8ecf-fd3b00f0c81a/ScreenShot_2021-02-28_at_16.43.56.png) ### 4.2 Performance Impact In this second experiment, we aim at measuring the impact of capturing network metrics using eBPF, including on CPU usage. We set iperf ’s message size to 128KB, which caused zero loss rate in the eBPF scalability evaluation. For monitoring CPU usage, we enabled dstat to collect data second by second and discarded the first and the last thirty seconds of measurements. 第2回目の実験では、eBPFを用いてネットワークメトリクスを取得した場合のCPU使用率を含めた影響を測定することを目的としています。iperfのメッセージサイズを128KBに設定し、eBPFのスケーラビリティ評価において損失率がゼロになるようにした。CPU使用率の監視には、dstatで秒単位でデータを収集し、最初と最後の30秒の測定値を破棄するようにしました。 We start by instrumenting kernel routines for capturing events related to open and closed connections, much the same as in Scope. We then extend it to collect network traffic metrics, namely the amount of traffic per connection. This is done in two different ways: First, by pushing events for each send/receive operation and aggregating them in user space (i.e., UserAgg). Second, by aggregating measurements in kernel structures and then pushing events of traffic statistics (i.e., KernelAgg). Scope と同様に、オープン接続とクローズ接続に関連したイベントをキャプチャするためのカーネルルーチンをインスツルメンテーションすることから始めます。次にこれを拡張して、ネットワークトラフィックのメトリクス、すなわち接続ごとのトラフィック量を収集します。これは二つの異なる方法で行われます。1 つ目は、各送受信操作のイベントをプッシュし、ユーザ空間（UserAgg）に集約することです。第二に、カーネル構造内の測定値を集約し、トラフィック統計のイベントをプッシュすることです（すなわち、KernelAgg）。 Figure 2 shows the impact of both alternatives on iperf through- put, as well as on CPU usage. Intercepting only connection estab- lishment (i.e., Conn.) introduces negligible 1% overhead over the baseline, since it only pushes events when a connection opens or closes. In fact, the behavior of iperf supports such results since it only creates a single connection, which is closed in the end of benchmark. 図 2 は、両方の代替案の iperf スループットと CPU 使用率への影響を示しています。接続の確立(すなわち、Conn.)のみを傍受すると、接続が開いたり閉じたりしたときにイベントをプッシュするだけなので、ベースラインよりも無視できる1%のオーバーヘッドしか発生しません。実際、iperfの動作は、ベンチマークの最後に閉じられた単一の接続を作成するだけなので、このような結果をサポートしています。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/a5c33e82-dcf3-479f-aafe-8b03d08ca73a/ScreenShot_2021-02-28_at_16.44.23.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/a5c33e82-dcf3-479f-aafe-8b03d08ca73a/ScreenShot_2021-02-28_at_16.44.23.png) In its turn, the Traffic (UserAgg) alternative enables instrumenta- tion on send and receive operations and pushes a single event for each executed operation. The results clearly show a high impact on iperf performance, about 68%, mainly due to CPU usage in user mode. The increased CPU usage in user mode is associated to the significant amount of events pushed from kernel to user space and then consumed by the frontend program, as previously depicted in Figure 1. Again, this shows that a naive extension of Scope is not viable for monitoring inter-application traffic. その代わりに、Traffic (UserAgg) の代替機能は、送受信操作の計測を可能にし、実行された操作ごとに単一のイベントをプッシュします。結果は、主にユーザモードでのCPU使用率が原因で、約68%という高い影響をiperfのパフォーマンスに与えることを明らかに示しています。ユーザモードでのCPU使用量の増加は、図1に示されているように、カーネルからユーザ空間にプッシュされ、フロントエンドプログラムによって消費される大量のイベントに関連しています。このことからもわかるように、Scope のナイーブな拡張はアプリケーション間のトラフィックを監視するためには有効ではないことがわかります。 Our proposal, the Traffic (KernelAgg) alternative, shows that when moving from pushing an event for each operation to period- ically pushing aggregated statistics, the overhead is significantly lower, from 68% to 9%. In detail, and contrarily to user aggregation, the impact on CPU usage is mainly on time spent in sys mode, which is expected since the aggregation is performed in kernel mode. 我々の提案であるTraffic(KernelAgg)の代替案では、各操作ごとにイベントをプッシュする方法から、定期的に集約された統計情報をプッシュする方法に移行すると、オーバーヘッドが68%から9%へと大幅に低下することがわかります。詳細には、ユーザアグリゲーションとは対照的に、CPU使用率への影響は主にsysモードでの使用時間にあり、これはアグリゲーションがカーネルモードで実行されるために予想されることです。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c86e156a-8376-4426-8d92-2353935bdb62/ScreenShot_2021-02-28_at_16.47.27.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c86e156a-8376-4426-8d92-2353935bdb62/ScreenShot_2021-02-28_at_16.47.27.png) ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/af0a743f-0913-4e16-bd37-cd2103622074/ScreenShot_2021-03-06_at_14.18.10.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/af0a743f-0913-4e16-bd37-cd2103622074/ScreenShot_2021-03-06_at_14.18.10.png) ## 5 CASE STUDY We now validate that the detail obtained by monitoring data ex- change is valuable to determine the best placement of components in physical resources. このように、変化をモニタリングすることで得られる詳細なデータは、物理資源の中でコンポーネントの最適な配置を決定する上で貴重なものであることが確認されました。 ### 5.1 Application scenario We use a layered storage and processing system as the application scenario. For the data storage layer we use the Cassandra NoSQL data store. For the data processing layer we use Spark through its SQL interface. Spark accesses data in Cassandra using the standard spark-cassandra-connector. アプリケーション・シナリオとして、ストレージと処理システムをレイヤー化したものを使用しています。データストレージ層では、Cassandra NoSQLデータストアを使用します。データ処理層には、SQLインターフェイスを介してSparkを使用します。Sparkは標準のspark-cassandra-connectorを使用してCassandraのデータにアクセスします。 This is an interesting case study as this is a typical architecture for Big Data processing in the cloud and its performance depends on various factors such as: data movement between the storage and the processing layer, as Spark scans Cassandra tables; data movement within Spark, as interim results are shuffled between various processing stages needed; memory available to cache origi- nal and interim data; and available CPU time, mainly in the Spark processing layer. これは、クラウドにおけるビッグデータ処理の典型的なアーキテクチャであり、そのパフォーマンスは、SparkがCassandraテーブルをスキャンする際のストレージと処理層の間のデータ移動、必要とされる様々な処理段階の間で中間結果がシャッフルされる際のSpark内のデータ移動、中間データや中間データをキャッシュするために利用可能なメモリ、主にSparkの処理層で利用可能なCPU時間など、様々な要因に左右されるため、興味深い事例です。 For experiments, we populate the Cassandra database with 2 million randomly generated rows, each containing ten bigint and four text fields (500 characters each in average). Each row is thus approximately 2 KiB in size. We use two SQL queries (Q1 and Q2) that we describe later, but that we design to minimize computation, which at the same time reduces the time needed to run tests and highlights monitoring overhead. 実験のために、ランダムに生成された200万行のCassandraデータベースに、それぞれ10個のbigintと4個のテキストフィールド（平均でそれぞれ500文字）を含むデータを入力します。したがって、各行のサイズは約2KiBです。後述する2つのSQLクエリ（Q1とQ2）を使用していますが、計算を最小限に抑えるように設計しているため、テストの実行に必要な時間が短縮され、モニタリングのオーバーヘッドが強調されます。 We use standard unmodified Docker containers for both Spark and Cassandra. The test environment is composed of four n1- standard-4 Google Compute Engine (GCE) instances (4 vCPUs and 15GB RAM), all in the same region and zone. All instances are running the standard Ubuntu 16.04 LTS (xenial) GCE image. Kuber- netes is installed with kubeadm and configured with the Flannel network fabric.2 Each instance is equipped with a local SSD scratch disk used for all container image and volume storage. SparkとCassandraの両方に標準の無修正Dockerコンテナを使用しています。テスト環境は、4つのn1-standard-4 Google Compute Engine (GCE)インスタンス(4 vCPU、15GB RAM)で構成されており、すべて同じリージョンとゾーンに配置されています。すべてのインスタンスは標準のUbuntu 16.04 LTS (xenial) GCEイメージを実行しています。Kuber-netesはkubeadmでインストールされ、Flannelネットワークファブリックで構成されています。2 各インスタンスには、すべてのコンテナイメージとボリュームストレージに使用されるローカルSSDスクラッチディスクが装備されています。 ### 5.2 Limitations of resource monitoring We now establish a baseline by discussing what can be achieved with traditional resource monitoring tools. Table 1 shows resource usage metrics for server instance running each of the queries. ここで、従来のリソース監視ツールで何が達成できるかを議論することで、ベースラインを確立します。表 1 は、各クエリを実行しているサーバーインスタンスのリソース使用量のメトリクスを示しています。 From these results we can observe that query Q2 requires more CPU time than Q1, while transferring less data across the network. Also, data transfer in Q1 is almost balanced while with Q2 instance-4 sends significantly more data than others. これらの結果から、クエリQ2はQ1よりも多くのCPU時間を必要とする一方で、ネットワーク上でのデータ転送量が少ないことがわかります。また、Q1のデータ転送はほぼ均衡していますが、Q2のインスタンス4では他のデータよりもかなり多くのデータを送信しています。 We might thus speculate that workers in Q1 should be con- centrated on a smaller number of server instances, to minimize data exchange. Or that, in contrast, Q2 is amenable to using more worker instances to leverage more CPUs as data exchange seems to be much smaller. But we really do not know for sure, as we don not know whether the data exchange observed is done mainly when scanning data from Cassandra or when shuffling interim results within Spark. したがって、Q1 のワーカーは、データ交換を最小限に抑えるために、より少ない数のサーバーインスタンスに集中しているのではないかと推測される。あるいは、対照的に、Q2ではデータ交換がはるかに少ないように見えるので、より多くのワーカーインスタンスを使用してより多くのCPUを活用することが可能であると考えられる。しかし、観測されたデータ交換が主にCassandraからのデータをスキャンするときに行われているのか、Spark内で中間結果をシャッフルするときに行われているのかはわからないので、本当に確かなことはわかりません。 Actually, this is such an important issue that the Spark interface exposes this information very clearly to allow the operator or an automated optimizer to make decision. This does however assume that the application is known and that the custom interface can be used. Moreover, it assumes that such custom interface is exposed to the cloud provider and that the cloud provider is able to understand the needs of all potential applications. 実際、これは重要な問題であり、Sparkインターフェイスはこの情報を非常に明確に公開し、オペレータや自動化されたオプティマイザが判断できるようにしています。しかし、これはアプリケーションが既知であり、カスタムインターフェースを利用できることを前提としている。さらに、そのようなカスタムインターフェースがクラウドプロバイダーに公開され、クラウドプロバイダーがすべての潜在的なアプリケーションのニーズを理解していることも前提としている。 The challenge is thus: How to make resource allocation and job placement decisions based solely on data that can be obtained with- out knowledge of the application, i.e., while treating the application as a black-box? 課題はこのようになっています。アプリケーションをブラックボックスとして扱いながら、アプリケーションの知識がなくても得られるデータだけに基づいて、どのようにしてリソースの割り当てや就職先の決定を行うのでしょうか? ### 5.3 Using data exchange monitoring Taking advantage of information about data exchanged in the sys- tem represented as a connection graph as described in Section 3, we improve our understanding of host affinity. Figure 3 shows heat- maps of inter-server traffic computed by projecting raw connection graphs on physical server identifiers. In contrast to information collected by cloud provider monitoring in the hypervisor, we also get information about data exchanged within the server, among the various processes. This points out a clear different between Q1 and Q2: Data exchange in Q1 is actually higher than Table 1 has shown. The majority of traffic in Q1 is within servers, as shown by the main diagonal. Based on our knowledge of Spark and Cassandra, we can derive this happens because Q1 is focused on scanning data as, with each instance has one Spark worker and one Cassandra server, this is the only way to get intra-host data exchange. セクション３で述べた接続グラフとして表現されたシステム内で交換されるデータの情報を利用して、ホストのアフィニティの理解を深める。図 3 は、生の接続グラフを物理的なサーバ識別子に投影して計算したサーバ間トラフィックのヒートマップである。クラウドプロバイダがハイパーバイザ内で監視して収集した情報とは対照的に、サーバ内で様々なプロセス間で交換されたデータに関する情報も得られます。このことから、第1四半期と第2四半期の間に明らかな違いがあることがわかります。主な対角線で示されているように、Q1のトラフィックの大部分はサーバー内で行われています。SparkとCassandraの知識に基づいて、各インスタンスには1つのSparkワーカーと1つのCassandraサーバーがあり、これがホスト内のデータ交換を得るための唯一の方法であるため、Q1はデータのスキャンに焦点を当てているため、このようなことが起こると推測することができます。 In contrast, Q2 has little intra-host traffic that is mostly uniform except for instance-4 server that sends substantially more data to all other hosts. In fact, instance-4 is where Spark master container, hence also spark-submit, is running. We can speculate that the traffic is related to shuffling in Spark, but we are not really sure, as Cassandra servers also exchange some data. These conclusions are however still not good enough, even if we had to use our knowledge of the application derive them and they cannot easily be automated for general workloads. 対照的に、Q2では、他のすべてのホストへのデータ送信が実質的に多いインスタンス-4サーバを除いて、ほとんど一様なホスト内トラフィックはほとんどありません。実際、インスタンス-4はSparkマスターコンテナ、したがってspark-submitも実行しています。このトラフィックはSparkのシャッフルに関連していると推測できますが、Cassandraサーバーもデータを交換しているので、はっきりとはわかりません。しかし、これらの結論は、アプリケーションの知識を使用してそれらを導出しなければならなかった場合でも、まだ十分ではなく、一般的なワークロードのために簡単に自動化することはできません。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/4ac8000b-e181-4a60-8eb9-48b55ae2c74d/ScreenShot_2021-03-06_at_14.21.10.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/4ac8000b-e181-4a60-8eb9-48b55ae2c74d/ScreenShot_2021-03-06_at_14.21.10.png) We can get actual evidence to confirm this speculation and avoid making use of application-specific knowledge by projecting the raw connection graphs on process types (i.e., their command lines), as shown in Figure 4. Although this is achieved with a similar query, it provides substantially different information, as it aggre- gates information from different server instances as long as they are running processes with the same command line. This information confirms that Q1 is transferring data from Cassandra servers to Spark CoarseGrainedExecutorBackend processes, which are part of the Spark worker container. In addition, it confirms that most data transferred in Q2 is between Spark workers. 図4に示すように、生の接続グラフをプロセスタイプ(すなわち、そのコマンドライン)に投影することで、この推測を確認し、アプリケーション固有の知識を利用しないようにするための実際の証拠を得ることができる。これは似たようなクエリで実現されているが、同じコマンドラインでプロセスを実行している限り、異なるサーバ・インスタンスからの情報を無視しているため、実質的に異なる情報を提供することになる。この情報は、Q1がCassandraサーバーからSparkワーカーコンテナの一部であるSpark CoarseGrainedExecutorBackendプロセスにデータを転送していることを確認しています。さらに、Q2で転送されたデータのほとんどがSparkワーカー間であることを確認しています。 A better understanding of what is happening can be obtained by observing Figure 5 that shows how processes map to server instances, as rendered by Graphviz. In detail, we set: node height from used RAM; node line width from average CPU used; edge width from amount of bytes exchanged (both directions) between processes; and colors from commands and command-pairs for nodes and edges, respectively. We also remove all edges that correspond to less than 10% of traffic, to improve clarity. From colors, we can easily glance that most of the data exchanged in each query is between different processes. From node line widths, we can see that CPU usage by Cassandra is more relevant in Q1. In short, it is clear that Q1 is performing a scan of a large amount of data and that Q2 is reading little data from Cassandra but performing a computation that requires shuffling. The next challenge is how to take advantage of these conclusions in both an automated application independent way and manually by exploiting all possibilities in the application. 具体的には、ノードの高さは使用RAMから、ノードの線幅は平均CPU使用量から、エッジの幅はプロセス間でやり取りされたバイト数（双方向）から、ノードとエッジのコマンドとコマンドペアの色をそれぞれ設定しています。また、わかりやすくするために、トラフィックの10%未満に該当するエッジはすべて削除しています。色から、各クエリで交換されるデータのほとんどが異なるプロセス間であることが容易にわかります。ノードの線幅から、CassandraによるCPU使用率がQ1の方が関連性が高いことがわかります。つまり、Q1は大量のデータのスキャンを実行しており、Q2はCassandraからほとんどデータを読み込んでいないが、シャッフルを必要とする計算を実行していることがわかります。次の課題は、アプリケーションに依存しない自動化された方法と、アプリケーションのあらゆる可能性を活用した手動の両方で、これらの結論をどのように活用するかということです。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/4856bd1a-edce-4a01-8039-e603b9e98f82/ScreenShot_2021-03-06_at_14.21.20.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/4856bd1a-edce-4a01-8039-e603b9e98f82/ScreenShot_2021-03-06_at_14.21.20.png) ### 5.4 Automatic placement In this section we show how the presented approach is compatible with techniques for improving the placement of the workload in an automated way without any application-specific knowledge. As this is something that can be done by the cloud provider, we set as the goal to pack the workload in the minimal number of servers and, while doing it, to minimize network traffic with minimal impact in user visible performance. This would allow, for instance, the cloud provider to reduce the number of physical servers that need to be powered on and to reduce a source of congestion. このセクションでは、アプリケーション固有の知識を必要とせず、ワークロードの配置を自動化して改善する技術との互換性を示す。これはクラウド・プロバイダーが行うことができることであるため、我々はワークロードを最小限のサーバー数に詰め込むことを目標とし、その一方でネットワーク・トラヒックを最小限に抑え、ユーザーの目に見えるパフォーマンスへの影響を最小限に抑えることを目標としている。これにより、例えば、クラウドプロバイダーは、電源を入れる必要のある物理サーバーの数を減らし、混雑の原因を減らすことができるようになります。 To this end, we resort to Pyevolve, a genetic algorithm frame- work written in pure Python, to build a simple optimizer that takes an initial placement of containers in servers, each of them corre- sponding to a set of processes, and outputs an optimized placement as the end result. The genome is thus a simple vector, with one element for each placeable component (container), and an integer value identifying each possible location (server), which is supported natively by Pyevolve. The fitness function for each individual out- puts the product of three factors: optimal result for each server where CPU cores are expected to be fully used, with a penalty for each underused server and a (larger) penalty for each overused server; optimal result for each server where RAM is expected to be fully used, with a penalty for each underused server and a (larger) penalty for each overused server; optimal result for no cross-server communication, with a penalty for all data transferred. この目的のために、純粋なPythonで書かれた遺伝的アルゴリズムのフレームワークであるPyevolveを利用して、単純なオプティマイザを構築しました。これは、サーバ内のコンテナの初期配置を受け取り、それぞれのコンテナは一連のプロセスに対応しており、最終的な結果として最適化された配置を出力します。ゲノムは、配置可能なコンポーネント（コンテナ）ごとに1つの要素を持ち、それぞれの配置可能な場所（サーバ）を識別する整数値を持つ単純なベクトルであり、Pyevolveでネイティブにサポートされています。個々のアウトプットに対するフィットネス関数は、3つの要素の積です。CPUコアが完全に使用されると予想される各サーバに最適な結果が得られ、使用されていないサーバにはペナルティが、使用されすぎたサーバには（より大きな）ペナルティが課せられます。 Using this optimization strategy, we produce placements con- sidering each of the benchmark queries. For query Q1 (Scan Opt.), the result is to place two Spark workers and two Cassandra servers in each server instance. For query Q2 (Shuffle Opt.), the result is to place three Cassandra servers in one server instance, and the remaining Cassandra together with all Spark workers in a second instance. In both cases, this corresponds to using only 50% of the resources initially allocated. この最適化戦略を使用して、ベンチマーククエリのそれぞれを考慮した配置を作成します。クエリQ1(Scan Opt.)では、結果は、各サーバーインスタンスに2台のSparkワーカーと2台のCassandraサーバーを配置する。クエリQ2(Shuffle Opt.)の場合、結果は、1つのサーバーインスタンスに3台のCassandraサーバーを配置し、残りのCassandraはSparkワーカーと一緒に2つ目のインスタンスに配置する。どちらの場合も、これは最初に割り当てられたリソースの50%しか使用しないことに相当します。 We then translate these results into placement constraints in Kubernetes manifests and redeploy and retest both queries with both placements to compare the results with the initial values. The results in terms of exchanged network traffic are shown in Table 2. It can be observed that although both strategies reduce inter- host traffic, which is expected as the number of hosts is reduced, using the strategy that is informed by inter-process data exchanged obtained from instrumentation of read and write operations results in greater reduction. Query runtime is degraded between 9% and 14%, always less with the Scan Opt. strategy based on monitoring of Q1, which results in a more balanced CPU distribution. 次に、これらの結果をKubernetesマニフェスト内の配置制約に変換し、両方のクエリを両方の配置で再配置して再テストし、初期値との比較を行います。交換されたネットワークトラフィックの観点からの結果を表2に示します。ホスト数の減少に伴って予想されるホスト間トラフィックの削減は、どちらの戦略でも行われていますが、読み書き操作のインストルメンテーションから得られるプロセス間のデータ交換に基づいた戦略の方が、より大きな削減効果が得られることがわかります。クエリ実行時間は、Q1の監視に基づいたScan Opt.戦略では9%から14%の間で低下しますが、常に低下しており、CPU分布のバランスが取れています。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6bc0ee45-5aed-4d88-baa5-b8a8c40ef733/ScreenShot_2021-03-06_at_14.29.43.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6bc0ee45-5aed-4d88-baa5-b8a8c40ef733/ScreenShot_2021-03-06_at_14.29.43.png) ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/97b59575-20e3-49e3-93e6-f621a79b0e1d/ScreenShot_2021-03-06_at_14.30.16.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/97b59575-20e3-49e3-93e6-f621a79b0e1d/ScreenShot_2021-03-06_at_14.30.16.png) ### 5.5 Manual optimization In this section we focus on improving application-specific configu- ration but using only results obtained from black-box monitoring. This is something that currently cannot be done easily by a cloud provider and the orchestration system, but is interesting to the ap- plication owner and operator. This way the operator can overcome the lack of appropriate application specific monitoring tools and, most interestingly, can use a single monitoring tool for complex systems assembled from a variety of components, as is typical in Big Data storage and processing. このセクションでは、ブラックボックス監視から得られた結果のみを使用して、アプリケーション固有の設定を改善することに焦点を当てる。これは、現在のところクラウドプロバイダーやオーケストレーションシステムでは簡単にはできないことですが、アプリケーションの所有者や運用者にとっては興味深いことです。このようにして、オペレータは適切なアプリケーション固有の監視ツールの不足を克服することができ、最も興味深いことに、ビッグデータのストレージと処理に典型的なように、様々なコンポーネントから構成された複雑なシステムに対して単一の監視ツールを使用することができます。 First, aiming at optimizing for the Q1 workload, we modify Ku- bernetes manifests to deploy each Spark worker together with a Cassandra server within the same pod (Scan Opt.). This makes them have the same IP address and allows Cassandra servers to be recognized as local in the default topology detector. Second, targeting the workload of Q2, we create an alternative deployment configuration that uses only one worker with four times as much resources assigned (Shuffle Opt.) in terms of CPU cores and RAM. まず、Q1ワークロードの最適化を目的として、Ku-bernetesのマニフェストを修正し、各SparkワーカーをCassandraサーバーと一緒に同じポッド内に配置します（Scan Opt. これにより、同じIPアドレスを持つようになり、デフォルトのトポロジー検出器でCassandraサーバをローカルとして認識できるようになります。次に、Q2のワークロードをターゲットにして、CPUコアとRAMの点で4倍のリソースが割り当てられた1つのワーカーのみを使用する代替のデプロイ構成を作成します（Shuffle Opt. We then run each of the workloads in the corresponding config- uration. Figure 6 shows that with Q1 almost all data exchanged is within the same server instance. In contrast, in Q2 data is sent to only a single server instance that is running the worker process. Although in terms of inter-host network traffic this seems so differ- ent, Figure 7 shows that actually they both correspond to the same thing: Data is only sent almost exclusively from Cassandra servers to Spark workers. 次に、それぞれのワークロードを対応する設定で実行します。図 6 によると、Q1 では、交換されるデータのほとんどすべてが同じサーバーインスタンス内で行われていることがわかります。対照的に、Q2 では、データはワーカープロセスを実行している単一のサーバーインスタンスにのみ送信されます。ホスト間のネットワークトラフィックという点では、これは非常に異なっているように見えますが、図7によると、実際には両者は同じことを示しています。 A better understanding of what has changed can be obtained by observing Figure 8 that shows how processes map to server instances as rendered by Graphviz. This figure directly compares to the original Figure 5, where the same colors are used for the same process types and inter-process links. The consequence of this change is shown in Table 3a, showing that the first optimization reduces runtime of Q1 by 12% and the second reduces the runtime of Q2 by 29%. Table 3b shows the impact in network traffic, which is particularly dramatic with the scan optimization and the Q1 workload. Graphvizでレンダリングされたサーバーインスタンスへのプロセスのマッピングを示す図8を見れば、何が変わったかをよりよく理解することができる。この図は、元の図5と直接比較すると、同じプロセスタイプとプロセス間のリンクに同じ色が使用されている。この変更の結果を表3aに示します。最初の最適化によってQ1のランタイムが12%短縮され、2番目の最適化によってQ2のランタイムが29%短縮されることがわかります。表3bはネットワークトラフィックの影響を示しており、特にスキャン最適化とQ1の作業負荷で劇的に変化しています。 The difference between these results and those of Section 5.4 lead to an interesting conclusion. As long as one needs application specific tools to monitor data exchange within distributed data storage and processing systems, it makes sense that configuration such as needed for the optimization of Q2, by trading workers processes for additional resources, are also performed in application specific ways. In fact, this configuration step for Spark needs to be done twice: One by setting the number of cores for each worker in Spark configuration and the other in Kubernetes manifests to make resources available. これらの結果と 5.4 節の結果との違いから、興味深い結論が導き出された。分散データストレージと処理システム内のデータ交換を監視するためのアプリケーション固有のツールが必要である限り、ワーカープロセスを追加リソースと交換することでQ2の最適化を行うために必要な設定も、アプリケーション固有の方法で行われることは理にかなっている。実際、Sparkのこの設定ステップは、Sparkの設定で各ワーカーのコア数を設定することと、Kubernetesのマニフェストでリソースを利用可能にするための設定の2回行う必要があります。 However, now that we can infer the need for this optimization in an application-independent way, it would be interesting to have more standard and automatic ways to do this configuration. This would allow, for example, a Kubernetes controller that can perform such trade-off when informed by a monitoring system capable of collecting connection traffic network metrics. This would nicely complement the ability of Spark to detect and adapt to data locality. しかし、アプリケーションに依存しない方法でこの最適化の必要性を推論できるようになった今、この設定を行うためのより標準的で自動化された方法があれば面白いと思います。例えば、接続トラフィックのネットワークメトリクスを収集するモニタリングシステムから情報を得たときに、Kubernetesコントローラがこのようなトレードオフを実行することができるようになる。これはSparkのデータロカリティを検出して適応する能力を補完するものになるだろう。 ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/777a962f-1265-42bb-a868-3c2e04dffca7/ScreenShot_2021-03-06_at_14.32.17.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/777a962f-1265-42bb-a868-3c2e04dffca7/ScreenShot_2021-03-06_at_14.32.17.png) ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/4a35c727-698d-4a15-b094-86bcf37ddb27/ScreenShot_2021-03-06_at_14.32.52.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/4a35c727-698d-4a15-b094-86bcf37ddb27/ScreenShot_2021-03-06_at_14.32.52.png) ![https://s3-us-west-2.amazonaws.com/secure.notion-static.com/7e646f36-8cfa-4549-9187-d53025f84ba7/ScreenShot_2021-03-06_at_14.33.04.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/7e646f36-8cfa-4549-9187-d53025f84ba7/ScreenShot_2021-03-06_at_14.33.04.png) ## 6 RELATED WORK Monitoring is a critical part of distributed systems deployments and therefore, not only there are several monitoring tools available, as there is an increasing research effort to build more capable monitor- ing systems. Depending on how components are instrumented and what metrics can be observed, current monitoring for distributed systems operate at system and application levels. モニタリングは分散システムの導入において重要な部分であり、利用可能なモニタリングツールがいくつかあるだけでなく、より高性能なモニタリングシステムを構築するための研究努力も増えています。コンポーネントがどのように計測され、どのようなメトリクスが観察できるかによって、現在の分散システムのモニタリングはシステムレベルとアプリケーションレベルで動作します。 The goal of system monitoring is to collect physical and virtual infrastructure metrics, such as containers and virtual machines, or, in other words, to monitor computational resource metrics. These metrics are part of a wide set of processor, memory, network and disk metrics. Popular tools like Ganglia \[9\], Nagios \[19\], Zabbix \[21\], Riemann \[16\], Prometheus \[18\] and MonALISA \[11\] provide an agent that periodically collects metrics during its execution and push them into a monitor server. However, they are unable to present resource utilization by collaborating processes within a distributed system. システム監視の目的は、コンテナや仮想マシンなどの物理・仮想インフラストラクチャのメトリクス、言い換えれば計算リソースのメトリクスを収集することです。これらのメトリクスは、プロセッサ、メモリ、ネットワーク、ディスクなどの幅広いメトリクスの一部です。Ganglia \[9\]、Nagios \[19\]、Zabbix \[21\]、Riemann \[16\]、Prometheus \[18\]、MonALISA \[11\]のような一般的なツールは、実行中に定期的にメトリクスを収集してモニタサーバにプッシュするエージェントを提供しています。しかし、これらのエージェントは分散システム内のプロセスを連携させてリソース利用率を提示することはできない。 On a higher level, application-level monitoring relies on custom agents, typically one per programming language, for instrument- ing libraries and application’s source code in order to trace the execution path of requests. Instrumentation is useful to identify unusual behavior patterns that may help to identify the root-cause of a given malfunction or misconfiguration. Google’s Dapper \[15\] is a tracing infrastructure used in Google services that provides more details about the behavior of Google’s infrastructure. Specifically, it records request flow by annotating messages that are sent through standard communication protocols such as Remote Procedure Calls (RPC) or HTTP, which can be used to diagnosis latency in multi- tier black-box services \[13\]. Other approaches such as D-Trace \[2\], Magpie \[1\], X-Trace \[4\] or its variant Pivot Tracing \[8\] try to cap- ture causality between distributed events to accurately pinpoint the root-cause of software anomalies. より高いレベルでは、アプリケーションレベルのモニタリングは、リクエストの実行パスを追跡するために、ライブラリやアプリケーションのソースコードを計測するために、通常はプログラミング言語ごとに1つのカスタムエージェントに依存しています。計測は、特定の誤動作や誤設定の根本原因を特定するのに役立つかもしれない異常な動作パターンを特定するのに有用である。GoogleのDapper \[15\]は、Googleのサービスで使用されているトレースインフラストラクチャで、Googleのインフラストラクチャの動作についてより詳細な情報を提供します。具体的には、リモートプロシージャコール(RPC)やHTTPなどの標準的な通信プロトコルを介して送信されるメッセージにアノテーションを付けることでリクエストフローを記録し、多階層ブラックボックスサービス\[13\]の遅延診断に使用することができます。D-Trace \[2\]、Magpie \[1\]、X-Trace \[4\]、またはその変種であるPivot Tracing \[8\]のような他のアプローチは、ソフトウェアの異常の根本原因を正確に特定するために、分散イベント間の因果関係を捕捉しようとしています。 All these approaches tackle take advantage from any prior knowl- edge about the system, namely when administrators have easy access to the infrastructure where the system is running, or instru- ment libraries or application’s source code in order to trace the execution path of requests. In order to alleviate this pain, there is an ongoing effort on separation of concerns and standards for context propagation for distributed systems. Canopy \[6\] is the Face- book’s end-to-end performance tracing infrastructure and identifies challenges and addresses them by decoupling aspects of context propagation, instrumentation and trace representation. In \[7\] the authors propose a layered architecture to separate the concerns of system developers and tool developers, enabling independent instrumentation of systems, and the deployment and evolution of multiple tools. OpenTracing \[17\] is an ongoing effort to standardize the APIs and instrumentation for distributed tracing. これらのアプローチはすべて、管理者がシステムが実行されているインフラストラクチャに簡単にアクセスできたり、リクエストの実行パスを追跡するためにライブラリやアプリケーションのソースコードを指示したりすることで、システムに関する事前の知識を活用することに取り組んでいます。この痛みを軽減するために、分散システムのための懸念の分離とコンテキスト伝搬のための標準化に向けた継続的な努力が行われている。Canopy \[6\] は Face book のエンドツーエンドのパフォーマンストレースインフラストラクチャであり、コンテキスト伝搬、計測、トレース表現の側面をデカップリングすることで課題を特定し、それに対処しています。\[7\]では、著者らはシステム開発者とツール開発者の関心事を分離し、システムの独立した計測を可能にし、複数のツールの展開と進化を可能にするレイヤードアーキテクチャを提案しています。OpenTracing \[17\]は、分散トレースのためのAPIとインストルメンテーションを標準化するための進行中の取り組みである。 When targeting the increasingly popular scale-out distributed systems (e.g., for Big Data) or loosely-coupled micro-services de- signs, it is hard to monitor process interactions in detail, for example, for placement decisions, without either instrumenting each applica- tion or incurring in excessive tracing overhead. Several approaches present different strategies to diagnose distributed systems. One proposal is an online and scalable method to infer the influence between components, in order to understand how a change in a component X can affect the other components \[12\]. This approach converts log time-stamped entries with raw measurements into signals and correlates them, allowing the administrator to answer queries regarding the influences on other components. Besides the requirement to log time-stamped entries with raw measurements, it relies on the influence between components which does not con- tribute to map their direct communication. Similarly, Iprof \[22\] is a request flow profiling tool that, from the statistically analysis of bi- naries and log parsing, infers the execution flow from runtime logs in black-box distributed systems. The approach in \[3\] generates models for black-box embedded systems based on a timestamped sequence of events, which result is a dependency graph of the sys- tem. However, the used algorithm is exponential to the number of events. To run this algorithm in polynomial time, some heuristics are considered, compromising the accuracy of the output model. ますます普及しているスケールアウトした分散システム（ビッグデータなど）や、疎結合のマイクロサービスを対象とした場合、各アプリケーションを計測したり、過度のトレースオーバーヘッドを発生させたりすることなく、プロセスの相互作用を詳細に監視することは困難です。分散システムを診断するための異なる戦略がいくつかのアプローチで提案されています。1つの提案は、コンポーネント間の影響を推論するためのオンラインでスケーラブルな方法で、コンポーネントXの変化が他のコンポーネントにどのように影響を与えるかを理解するためのものです\[12\]。このアプローチでは、生の測定値を含むログのタイムスタンプ付きエントリを信号に変換し、それらを相関させることで、管理者が他のコンポーネントへの影響に関する問い合わせに答えることができるようにします。生の測定値でタイムスタンプのついたエントリをログに記録するという要件に加えて、それはコンポーネント間の影響に依存しており、コンポーネント間の直接の通信をマッピングすることはできません。同様に、Iprof \[22\]はリクエスト・フロー・プロファイリング・ツールであり、バイナリの統計的分析とログ解析から、ブラックボックス分散システムのランタイム・ログから実行フローを推測する。\[3\]のアプローチでは、タイムスタンプの付いた一連のイベントに基づいてブラックボックス型の組み込みシステムのモデルを生成し、その結果としてシステムの依存性グラフを生成しています。しかし，このアルゴリズムはイベントの数に対して指数関数的である．このアルゴリズムを多項式時間で実行するために、いくつかのヒューリスティックが考慮されていますが、出力モデルの精度を損なうことになります。****