ISSRE2024 Logs - yuuk1's Digital Garden

[ISSRE 2024](https://issre.github.io/2024/index.html) - [[ISSRE 2024トレンド]] # Day1 ## WDMD https://issre-wdmd.github.io/#/program ### Keynote 3: Towards building fault-tolerant and compression-accelerated HPC systems Guanpeng Li Jun Ai or AI-powered software reliability engineering and it's application - Device Circuit level - Architeclevel - OS level - Application level Error Propagation in programs The goal is to pay the minumum overhead to each the target reliability. Selective Instruction Duplicationj SDC Rate - Developling Fault Torelant Applications - HPC applications source code level - Evaluate program SDC rate -> (not acceptable) Masreu instruation SDC rates -> Selective Protection -> return to the first - New Release (Acceptable) - Our approach: Trident - Three-level modeling - Register-communication - Control-flow - Memory dependency - Day to minutes Is your aporoach specific to HPC systems? - HPC fault-torelance. - mitigating soft errors - fault injections The size of cause trident program how - training to inference - data flows - data flow programs - different approach ### Keynote 4 Reliability Analysis and Evaluation of Computing Network > Xing Pan is director of the Security Center for Systems and Intelligent Systems, head of the Department of Safety Science and Engineering, quality management expert of AVIC Group, and editorial board of Systems Engineering and Electronic Technology. His research interests include reliability and risk analysis, system/system engineering theory and method, human-machine system safety analysis and human-factor reliability analysis. > クラウド・コンピューティング、ハイパフォーマンス・コンピューティング、人工知能（AI）技術の急速な進歩、特に大規模言語モデル（[[LLM]]）の出現は、個々のデバイス内の強力なコンピューティング能力に対する需要を増大させただけでなく、データ伝送と通信を一貫して安定的かつ効率的にサポートできるコンピューティング・ネットワークの必要性を強調している。従来のネットワークとは異なり、規模が大きく、運用ライフサイクルが長く、サービスプロファイルが複雑なコンピューティング・ネットワークは、信頼性の評価と強化という点で独自の課題を抱えています。喫緊の課題として、コンピューティングネットワークの信頼性を迅速かつ的確に評価し、改善策を講じるための手法の開発が求められている。そこで本研究では、[[システム工学|システムズエンジニアリング]]のVモデルに基づいた、計算機ネットワークの信頼性分析・検証手法を紹介する。接続性、性能、サービスという3つの階層レベルを指向し、単一故障モードと連成故障モードの両方を解析することを目的としている。さらに、コンピューティングネットワークのサービスプロファイルとフローを生成するための手法を提案する。その後、本研究では信頼性シミュレーション評価を実施し、コンピューティングネットワークの信頼性設計を最適化するための評価基盤と指針を提供する。 - Founded by Huawei - [What Is a Computing Network? Why Do We Need Computing Power and Computing Networks? - Huawei](https://info.support.huawei.com/info-finder/encyclopedia/en/Computing+network.html) - 1. Background - HPC - Network reliability - 2. Concept - Collective communications (Logical) - Scatter, Broadcast, REduce, Boracast - distributed training - interconnected - topology (Phisical) - Spine-leaf - Torus - Dragonfly - Computing Network - Mapping logical network to physical network - Three layer - Concept of network reliability - Serivce/Logic > Performance > topologu/phisics - Fault-oriented - Reliabiliuty concerns about fault - The bothrule curve - Outline for research - Gap: Characractiscs vs concen of reliability - How to analyze ? - How to evaluate ? - How to design/optimize? - Systems Engineering Vee-model - Analysis: Fault analysis, service analysis - Evaluation:evel model and test method - 4. - Failure cause - Working stress, Environment Stress - Failure modes - Service failure, Performance Failure, Connectivity Failure - Test profile - service profile <-> fault sampling <-> environment profile - 5. - Probability that network meets performance - Weighted performance reliabolity - Service completion probability - Simulation based on NS3 - Network traffice saturation, T=3.4s, 9.4s - 6. - we need to know the language models to walk i need show thep orcess of language models use some process to show the walks process, - Next step, langaueg model NS3 training how to lock. 10 minutes ### Keynote 4: Parallelism in LLMs: Beyond Data, Tensor, and Pipeline Parallelism > 大規模言語モデル([[LLM]])は、学習と効果的な展開に膨大な計算資源を必要とします。データ並列、テンソル並列、パイプライン並列などの技術は、この作業負荷を分散するための標準的なアプローチとなっていますが、並列の次のフロンティアは、モデルのスケーラビリティと効率の限界をさらに押し広げることを約束します。本講演では、メモリ利用の最適化、ノード間通信の改善、ハードウェアの進歩の活用に焦点を当てながら、現在のパラダイムを超える並列化の新たな手法と戦略を探ります。また、LLMのスケーリングにおける課題と、並列処理における将来の革新がどのように前例のない性能向上をもたらすかについても議論します。 - AI for Science - Tackling Scientific Challenges that are Intractable w/o AI - AI4Science -> HPC - RIKEN - Parallelism - Used in production - Data parallel - Tensor parallel - Pipeline parallel - Sequential/Context parallel (CP) - Fully Sharded data parallel (FSDP) -> not really parallelism - Research - Expert parallel - Computational graph partitioning - Hybrid fully shared tensor-data parallel - 4D Parallelism (Llama 3.1) - TP+CP+PP+DP - [\[2407.21783\] The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783) - Data parallel - Model replicated at each worker - Tensor parallel - Each block divided among workers - COMM per iteration - Z = XAB - backward gradient - B A - distributed A and B into row and column - Transoformer block - PP - different layers on dirrerent worker - COMM per iteration: per layer several rounds of P2P - Many schemes for pipeline - SOTA is the almost-zero pipeline scheme (ICLR 2024) - FSDP - Primary advantages is memory efficiency rather than outright speed improvement - hybird - GTP3-175B - TP inside the node - CP - distributing lon sequences among GPUs as short contiguous segments - Communication overhead due to token inter-dependence - Nvisia se parallelism - MS deepspeed-Ulysses - Expert Parallel [EP] - MoEs is NOT different experts combined together - Appraches to represent different experts: blocks, heads, MLPs/FFN - Misconception: "tolens of same topicks go to same expert" - Tokens routed to different FFN experts - COMM per iteration: per layer2 rounds of AlltoAll - Hybrid fully shared tensor-data parallel - Future of GPUs - Your AI benchmark now only needs data-parallelism - Inference Scaling Laws ### Panel: Reliability Technologies for AI training/Inference Systems Chair: Zheng zhen ## HFSD https://hfsdworkshop.github.io/index.html Paper: Failing and Learning: A Study of What is Learned about Reliability from Software Incidents Jonathan Sillito (Brigham Young University, USA) and Matt Pope (Brigham Young University, USA) # Day1 ## Opening Remarks 参加者は100人ぐらい。 EMSEジャーナルへ投稿可能。 - Industry - Total 33 - Accepted 15 - rate 45.5% - 26 reviewers - 81 reviews - Chaina、Germany、US、Japanが多い。 - reviewers - JAXA tutomu kobayashi - Apple、Meta、Huaweiなど ## KEYNOTE TALK 1 Quality Assurance of AI-based Systems https://issre.github.io/2024/program_keynotes.html#hiroshimaruyama Hiroshi Maruyama > 概要：近年のAI技術の急速な進歩は、その内部動作の不透明さゆえに、これらの技術を採用したシステムの品質保証（QA）に大きな課題を投げかけている。本講演では、AIを利用したシステムの品質保証における様々な課題と、それに対すっっｋる取り組みについて議論する。これらの課題は、工学や社会科学など他の分野ｋにも類似点があり、根本的に学際的な議論が必要であることを主張する。 > IBM東京基礎研究所に26年間勤務し、人工知能、自然言語処理、機械翻訳、手書き文字認識、マルチメディア、XML、Webサービス、セキュリティなど、さまざまなコンピュータサイエンス分野の研究に従事。 2006年から2009年までIBM東京基礎研究所所長。 2011年から2016年まで統計数理研究所教授としてビっｋッグデータ、統計学、それらが社会に与える影響に関するプロジェクトに従事。 2016年4月、Preferred Networks, Inc.に最高戦略責任者として入社。現在の研究テーマは、機械学習の実用化、情報技術と機械学習の社会的意義、コンピュータサイエンスと統計学全般。現在、花王株式会社エグゼクティブフェロー、東京大学工学系研究科附属人工物工学研究センター主任研究員、プリファード・ネットワークス株式会社シニアアドバイザー。 - Statistical Machine Learning - Model must be known in advance, and Algorithm must be constructibl - Data-Driven, Inductive Programmin - 疑似チューリング完全 G. Cybenko. Approximations by superpositions of sigmoidal functions. Mathematics of Control, Signals, and Systems, 2(4):303–314, 1989 - https://karpathy.medium.com/software-2-0-a64152b37c35 - Statistical Machine Learning works only if the future is similar to the pas - Challenge: Quality Assurance - Cat or Dog? - テストが困難で、デバッグもメンテも困難。 - Engineering as a form of agreement between engineers and the society - 機械学習工学 [[MLOps]] - [[Quality ISO9000の定義]] - Case Studies - Dominguez, Gonzalo Aguirre, Keigo Kawaai, and Hiroshi Maruyama. "Quality Assurance for ML Devices A Risk-Based Approach." 2023 30th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 2023. - Q. How do we bridge the gap between the system and the stakeholders? ``` AI システムの品質保証に関するプレゼンテーションの技術的な要約を日本語でご説明します：主要な技術ポイント： 1. 機械学習の基礎： - 機械学習は（ステートレスな）関数として扱われる：Y = f(X) - X は非常に高次元で、連続変数とカテゴリ変数の混合 - ディープニューラルネットワークは疑似チューリング完全な普遍的計算メカニズムとして機能 - 従来のプログラミングとの主な違い：明示的なアルゴリズムを必要としないデータ駆動型の帰納的プログラミング 2. 機械学習の根本的な制限： - 統計的な性質：将来のパターンが過去の訓練データと一致する場合のみ有効 - 未知の領域やレアケースに対する性能の低さ - 本質的に確率的：i.i.d.サンプリングにより100%の正確性は保証されない - 大規模言語モデル（LLM）は多様な言語データで訓練された確率的モデルとして特徴付け 3. 品質保証の課題： - オープンワールドの問題による試験の困難さ - 正しい出力の判定の複雑さ - 説明可能性の制限 - 変更時の予測不可能な挙動 - ML システム固有の新しいエンジニアリング規律の必要性 4. 品質保証フレームワーク： - AIST「ML品質管理ガイドライン」：3つの外部品質特性と8つの内部品質特性 - QA4AI：システムレベルの品質と期待値制御に焦点を当てた5軸アプローチ - 品質測定アプローチ： - プロダクトメトリクス：製品の直接測定 - プロセスメトリクス：開発プロセスの測定 - パフォーマンスメトリクス：ステークホルダーが観察する結果 5. ケーススタディ： a) 産業機械： - 対象：NC工作機械のパラメータ設定 - フレームワーク：FMEA（故障モード影響解析） - ハザード識別に使用 b) 化学プラント運転： - 領域：常圧蒸留装置の運転 - フレームワーク：HAZOP（危険性と操作性の研究） - 「ガードレール」設計パターンの実装 c) 仮想人体生成モデル（VITA NAVI®）： - 高次元のヘルスケアデータ生成 - マルチステークホルダーによるガバナンス構造 - 結合分布モデル：P(x1, x2, ..., x2000) - 経験分布との比較による品質管理 6. 現状と今後の方向性： - AIシステムの品質保証に関する知識はまだ発展途上 - 領域によって異なる品質特性とステークホルダーへの配慮が必要 - 汎用的なガイドラインではなく、領域固有のベストプラクティスの必要性 - 期待値制御とマルチステークホルダーコミュニケーションの重要性 - 提案される発展経路：ベストプラクティス → パターン → ガイドライン → ハンドブック → 規制このプレゼンテーションは、AIベースのシステムの品質保証における根本的な制限と課題を認識しつつ、ML システム固有の新しいエンジニアリングアプローチの必要性を強調しています。 ``` - I could not follow. ## Best paper Candidates ### API2Beh: Learning Behavior Inclination of APIs for Malware Classification Lei Cui, Yiran Zhu, Junnan Yin, Zhiyu Hao, Wei Wang, Peng Liu, Ziqi Yang and Xiaochun Yun - Severiality 6.06 billion malware are observed (2023) - Existing - Static Analysis - Cons: fail to handle obfuscation, packing, encryp - Dynamic Analysis - Cons: Need specific env, evasion, ... - Research Challenges - High overhead - Poor Realworkd performance - Interpretability - Basic IDea - API execusion -> BI embedding - Findings - Key Finding 1, 2, 3 - 3: similar behaviors ### LiScopeLens: An Open-Source License Incompatibility Analysis Tool Based on Scope Representation of License Terms Ziang Liu, Xin Liu, Yingli Zhang, Zihao Zhang, Song Li, Weina Niu, Qingguo Zhou, Rui Zhou and Xiaokang Zhou - License risk - LLMにより、よりチャレンジングに。 - Heatmap of consistent distribution of license compatibility. - ライセンスの互換性を集合論にしたがって、チェックするアルゴリズム（Algo 2）を提案している。 - OpenHarmony 5.0 beta, 10 conflict patterns, 80% were confirmed as real - Future work - LLMで自動抽出 ### Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers Yuan Yuan, Tongqing Zhou, Xiuhong Tan, Yongqian Sun, Yuqi Li, Zhixing Li, Zhiping Cai and Tiejun Li - National University of Defence Technology , Nankai Univ - a huge amount of out-of-band alerts (IPMI alerts) - aleret reporting - thouhnds of aletrs per day - Alert overload - Unlike occastional alert burts from online services - Challenges - 1: shape=based aggregation not work - clustering and similarity-based - sensor alert line: jumpy and non somooth - 2: causal relashinspis not work - sensor names are very complex and not readable - Design - Observations:1: Many frequent burst of dirrent patterns - redundancy - Obser 2: show similar torens with a causal relashinship - byproduct - motivate burst patterns - Superagg - offline stage: - learning kmowledge hierarchically from hisotorical alerts - tier1: sensor-tier - contrastive learning - Time2State, a best SOTA - Human in the loop - system-tier - causal pattern modeling - Aprior method - Generate rule - online - Step1: strategy-based - silent awaiting (fake and wander patterns) - see&suppression (jitter patterns) - Step2: rule-based aggregation - Eval - two datasets - 2 months alerts 1552k - Sentinel alerts 600 - Merics - aggregation rate - Accuracy - Results - Improvement of about 0.9% to 3.83 % - At least 83.8% lift on dataset a, 43.2% on dataset B - 1350 to 9180 => 57 to 470 per day - Conclusion - first work to solve the alert overload problem for supercomputers - aggretae the burst pattern ## Research Track 2: Anomaly Detection I ### Detection Latencies of Anomaly Detectors: An Overlooked Perspective? Tommaso Puccetti and Andrea Ceccarelli https://arxiv.org/abs/2402.09082 - FP and detection latency, a clear relation - not beign monotonic, increasing the FPR would lead to a decrease latency ### Self-Evolutionary Group-wise Log Parsing Based on Large Language Model Changhua Pei, **Zihan Liu**, Jianhui Li, Erhan Zhang, Le Zhang, Haiming Zhang, Wei Chen, Dan Pei and Gaogang Xie [[2024__ISSRE__Self-Evolutionary Group-wise Log Parsing Based on Large Language Mode]] - Traditional parser - Frequent pattern mining LogGram - Heuristic based log parser such as Drain - LLM-based log parser - Log source -> Manual labeled log - Input log -> prompt -> LLN - (Few shot learning) - Concern 1: one by one parsing - network latency - real: 10000 + logs/s - inefficient - Point-wise parsing - Solusion 1: group-wise parsing - call by group LLM - not as accurate - Concern 2: dependent opn manually annotated logs - Need to manually pre label data - Solusion 2: - Update template - Design goal - Faster parsing speed - Higer log parsing accuracy - Overview of SelfLog - N-gram based grouper Clustering - Log hitter - based on template - (if not hit) call LLM parser - Prompt structure - task desc + human knowledge + input logs + few shots + output format - Tree-based Merge - log template merge - token tree - Performance - Metrics: Parsing accuracy, Precision template accuracy, ... - Baselines - LenMa, Logram, Drain, Spell, Log{{T, DivLog - Macだけ低い結果だが、平均0.975, 0.942 - Proxicifer - Param sensitivity - threadhold - Parsing speed - safe zone: - 10^4 logs /s ぐらいまでスケール - LLama2 Mistral, GPT-3.5 - GPT-3.5がbest - Conclusion (Question) LLM ### TimeSeriesBench: An Industrial-Grade Benchmark for Time Series Anomaly Detection Models **Haotian Si**, Jianhui Li, Changhua Pei, Hang Cui, Jingwen Yang, Yongqian Sun, Shenglin Zhang, Jingjing Li, Haiming Zhang, Jing Han, Dan Pei and Gaogang Xie [[2024__ISSRE__TimeSeriesBench - An Industrial-Grade Benchmark for Time Series Anomaly Detection Models]] - why - AD pipeline - OFfline unaffordable storage & maintenance cost - online: no performance reference on brand new services - eval: metrics are detaches from practical demands - anomaly weights revision - AR is the best - simple structure https://adeval.cstcloud.cn/content/leaderboard ### Detecting Numerical Deviations in Deep Learning Models Introduced by the TVM Compiler Xia Zichao, Chen Yuting, Nie Pengbo and Wang Zihan https://github.com/apache/tvm ## Research Track 4: Anomaly Detection II ### LLMeLog: An Approach for Anomaly Detection based on LLM-enriched Log Events Minghua He, Tong Jia, Chiming Duan, Huaqian Cai, Ying Li and Gang Huang [[2024__ISSRE__LLMeLog - An Approach for Anomaly Detection based on LLM-enriched Log Events]] - ### LogCAE: An Approach for Log-based Anomaly Detection with Active Learning and Contrastive Learning Pei Xiao, Tong Jia, Chiming Duan, Huaqian Cai, Ying Li and Gang Huang [[2024__ISSRE__LogCAE - An Approach for Log-based Anomaly Detection with Active Learning and Contrastive Learning]] ### VCRLog: Variable Contents Relationship Perception for Log-based Anomaly Detection Jinyuan Wang, Tong Li, Runzi Zhang, Zifang Tang, Di Wu and Zhen Yang [[2024__ISSRE__VCRLog - Variable Contents Relationship Perception for Log-based Anomaly Detection]] ### Leveraging RAG-Enhanced Large Language Model for Semi-Supervised Log Anomaly Detection Wanhao Zhang, Qianli Zhang, Enyu Yu, Yuxiang Ren, Yeqing Meng, Mingxi Qiu and Jilong Wang # Day3 ## KEYNOTE TALK 2 Quantum Circuit Compilation and Compression Kae Nemoto > 概要：ここ数年、量子コンピューターの実現に必要な技術が急速に発展している。量子ビットの数は1,000を超え、論理量子ビットを構築することで物理量子ビットの領域から抜け出そうとしている。論理量子ビットはフォールト・トレラントな実装が可能であり、論理量子ビット上のノイズをアルゴリズムの最後まで制御し続けることができると考えられている。本講演では、フォールト・トレラント量子コンピュータ（FTQC）技術スタックを紹介する前に、論理量子ビットと物理量子ビットの違いについて概説する。 FTQC技術スタックでは、量子コンピュータアーキテクチャが中間に位置し、下の技術層（ハードウェア）と上の技術層（ミドルウェア/ソフトウェア）をつなぐ。フォールトトレラント量子計算では、フォールトトレラント量子コンピュータを高速化するために、ゲート回路の圧縮が重要である。そのため、ゲート回路のコンパイルと表現（言語）は、ゲート回路の深さを安全に減らすために必要である。本講演では、フォールト・トレラント量子コンピュータにおいて、これら3つの要素がどのように連携しているかを説明する。 > バイオ沖縄科学技術大学院大学教授、OIST量子技術研究センターセンター長。国立情報学研究所（NII）教授、量子情報科学国際研究センター長、日仏情報学研究所（JFLI）共同所長。量子コンピュータの応用、量子機械学習、量子コンピュータ・アーキテクチャ、量子ミドルウェア、量子ネットワーク、量子インターネット、複雑系の研究に従事。また、学術教育コンソーシアム "Quantum Academy for Science and Technology "を主宰し、この分野の学部生や大学院生向けに質の高い講義や教材を提供している。 IoP（英国）とAPS（米国）のフェロー。 - Quantum computer development - ![[IMG_3200.jpg]] ## Industry Track 1: Best Industry Paper Candidates ### Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction Yudong Liu, Minghua Ma, Pu Zhao, Tianci Li, Bo Qiao, Shuo Li, **Ze Li**, Murali Chintalapati, Yingnong Dang, Chetan Bansal, Saravan Rajmohan, Qingwei Lin and Dongmei Zhang Micrisoft Asia [[2024__ISSRE__Early Bird - Ensuring Reliability of Cloud Systems Through Early Failure Prediction]] Thank you for your grate presentation. Your evaluation focuses on only F1-score, but false positive is also import because the engineer's effort of alert response should be reduced. My question is that how many false positives are there? How sensitive is it to ahead interval? From my feeling, increasing ahead interval leads to the decrease of false positive. ### An Exploration of Fuzzing for Discovering Use-After-Free Vulnerabilities Zeyu Chen, Jidong Xiao, Angelos Stavrou and Haining Wang ### Auto-PIP: Real-time Identification of Critical Performance Inflection Points in Software Stress Testing Shenglin Zhang, Xiao Xiong, Mengyao Li, Yongqian Sun, Yongxin Zhao, Xia Chen, Bowen Deng and Dan Pei [[2024__ISSRE__Auto-PIP - Real-time Identification of Critical Performance Inflection Points in Software Stress Testing]] Thank you for your grate presentation. I enjoyed that your method is desigend to use several traditional statistical approaches, not deep-learning based. But i cannot understand the real scenario of automatically identifying inflection points. Why not manual identification? ## Test-of-Time Award Session Predicting Vulnerable Components: Software Metrics vs Text Mining, ISSRE14 https://ieeexplore.ieee.org/document/6982351 安全なソフトウエアの構築は難しく、時間とコストがかかる。脆弱性が発生しやすいソフトウェア・コンポーネントを特定する予測モデルを使用することで、セキュリティ対策に重点を置くことができ、ソフトウェアの安全性を確保するために必要な時間と労力を削減することができる。過去 10 年間にわたり、いくつかの種類の脆弱性予測モデルが提案されてきた。しかし、これらのモデルは、異なる方法論とデータセットを用いて評価されており、異なるモデリング技術の相対的な長所と短所を判断することを困難にしている。本稿では、この問題を解決するために、3 つのウェブアプリケーションで発見された 223 件の脆弱性を含む、高品質な公開データセットを提供する。このデータセットを用いて、テキストマイニングに基づく脆弱性予測モデルを、予測因子としてソフトウェアメトリクスを用いるモデルと比較した。その結果、テキストマイニングモデルは、3つのアプリケーションすべてにおいて、ソフトウェアメトリクスに基づくモデルよりも高いリコールを示した。 - Vulnarebility prediction system (VPS) - In 2014, VPS were rate and new. - so, we had to expalin the need - 6 reasons, fewer vulnerabilities, fewer projects trac vulnerabilities - study of VPS are increasing - Contributions - Demonstrated NLP could work better than software metrics. - A novel dataset. - Missing - Exploration of dirrerent model types - hyperparameter tuning ## Keynote talk 3: Generative AI Applications and Trustworthiness 要旨：本講演では、台湾におけるGAI応用の事例を紹介する。しかし、GAI技術を用いたアプリケーションは、いくつかの信頼性の問題を引き起こす可能性がある。このような問題と、GAIアプリケーションの信頼性を高めるために我々が開発している技術について述べる。略歴台湾行政院デジタル担当大臣。メリーランド大学でコンピューターサイエンスの博士号を取得。 1989年にAT&Tベル研究所に入社。ソフトウェア実装フォールトトレランス（SwiFT）ツールの研究は、AT&Tの数十の通信システムに応用され、1992年にはベル研究所の10大技術ブレークスルーの1つに選ばれた。 1996年、ベル研究所の特別技術スタッフに就任。 1999年にAT&Tでディペンダブル・コンピューティング研究部門を立ち上げ、AT&Tの全サービスの高いディペンダビリティを保証する部門長を務める。 2004年、ディペンダブル・ディストリビューテッド・コンピューティング＆コミュニケーション研究部のエグゼクティブ・ディレクターに就任し、AT&Tのディペンダビリティ研究プログラムを指揮。 2007年、台湾情報産業研究院の副院長に就任。また、2010年から2015年にかけては、行政院科学技術顧問グループの副事務次官を務め、台湾政府の情報通信技術開発政策と資金配分を支援。 2015年から2024年まで台湾中央研究院情報技術革新研究センター（CITI）所長。黄博士は20以上の米国特許を持ち、150以上の論文が著名な学術誌や学会で発表されている。彼のSoftware rejuvenation論文は2019年にJean-Claude Laprie Awardを受賞した。 IEEEフェロー。 - Duplino increase developer speeed with Copilot ## Huawei Session https://issre.github.io/2024/ss_program_huawei.html Reliability Challenges and Progress for Huawei Cloud in AI era Speaker: Zhenli Sheng Bio: Dr. Zhenli Sheng joined Huawei in 2015 after completing his Ph.D. He currently serves as the Director of the Cloud Availability Engineering Lab, where he leads efforts in technical innovation and the application of reliability and availability solutions for Huawei Cloud. Dr. Sheng has authored over 10 papers in top-tier journals and holds approximately 15 patents. > ファーウェイ・クラウドは、クラウド・コンピューティングにおける世界有数のプロバイダーであり、約200のサービスを提供し、100万台近いサーバーを運用しています。その歩みの中で、私たちはハードウェアとソフトウェアの両方に関連する様々な信頼性の課題に直面してきましたが、そのうちのいくつかはAI時代に特に重要になっています。本講演では、AIクラスタにおける主要な信頼性問題を探求し、ハードウェア障害予測、サイレントデータ破損検出、大規模言語モデル（LLM）トレーニング中の回復力を通じて、これらのリスクにどのように対処しているかを紹介します。最後に、引き続き注意が必要な、残された課題をいくつか紹介します。 - Huawei Cloud - 240+ cloud service, 6+ million developers - reliability challenges - Hardware Fault - Silent data corruption - LLm training - Hardware Failure - memory devices - DRAM speed and density increase <-> decrease reliabolity - How does memory failure - How to predict memory failure and reduce VM interruptions with the fetched error logs. - Memory failures are complex to diccicult to predict - lacking of fine-grain telemetry. - heterogeneous data sources. - Imbalance and evolving data. - Bit-level failure charactering and prediction across CPU arch - HBM exhibits different error patterns from conventional DRAM - Sumamry - Hardware fault analysis - MLOps-enabled haredware failure prediction system - HPCA 2025 - Hardware [Hw\_RAS 2025](https://hyxie2023.github.io/Hw_RAS-2025.github.io/) - Silent Data Corruption in AI system - SDC is diffcult to locate - NVidia proposes redundant computation - Google analyzed the progation - [Understanding and Mitigating Hardware Failures in Deep Learning Training Systems | Proceedings of the 50th Annual International Symposium on Computer Architecture](https://dl.acm.org/doi/abs/10.1145/3579371.3589105) - SDC in AI cluster - Resilient training - An alternative approach is to continue training with availability subset - DLRover - [dlrover/docs/tutorial/torch\_elasticjob\_on\_k8s.md at master · intelligent-machine-learning/dlrover · GitHub](https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/tutorial/torch_elasticjob_on_k8s.md) - [ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation](https://arxiv.org/html/2405.14009v2) - [Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates | Proceedings of the 29th Symposium on Operating Systems Principles](https://dl.acm.org/doi/10.1145/3600006.3613152) - cfp resilient training for AI foundation model ## Research Track 6: Tools and Artifacts ### LabelEase: A Semi-Automatic Tool for Efficient and Accurate Trace Labeling in Microservices Shenglin Zhang, Zeyu Che, Zhongjie Pan, Xiaohui Nie, Yongqian Sun, Lemeng Pan and Dan Pei [[2024__ISSRE__LabelEase - A Semi-Automatic Tool for Efficient and Accurate Trace Labeling in Microservices]] ## DOCTORAL SYMPOSIUM AND PANEL Panel: Generative AI and a New Academic Reality: Challenges and Opportunities for PhD Students ### Search-Based White-Box Fuzzing of Web Frontend Applications Iva Kertusha - automatically execution vs automatically generation - white-box examine internal logic and structure of the code crawlijax - Evomaster - [GitHub - WebFuzzing/EvoMaster: The first open-source AI-driven tool for automatically generating system-level test cases (also known as fuzzing) for web/enterprise applications. Currently targeting whitebox and blackbox testing of Web APIs, like REST, GraphQL and RPC (e.g., gRPC and Thrift).](https://github.com/WebFuzzing/EvoMaster) - Systematic literature review - 158 papers - 14 research questions - SBST with a focus on backend - information depends on server response - SBST with a focus on fronteond - our: sbst + whitebox - Extend evomaster - Case studies - PhD plan - 2024 march ~ - working on SLR ### Reliable Online Log Parsing Using Large Language Models with Retrieval-Augmented Generation Hansae Ju - LLM-based limitations - template unseen in training or prompts - high computation cost - Semantic Log Parsing - RAGLogParser - similar examples from a DB - Semantic template labeling ### Panel: Generative AI and a New Academic Reality: Challenges and Opportunities for PhD Students # Day 4 ## Keynote Talk 4: Software Reliability in the Era of Large Language Models: A Dual Perspective > 要旨: ソフトウェア工学の研究の多くは、信頼性の高いソフトウェアシステムを構築することに費やされてきた。過去20年間で、ソフトウェア工学のデータ利用可能性が高まり、多くのAI主導の自動化ソリューションに拍車がかかった。ここ数年は、ソフトウェアの信頼性向上を含む多くのタスクでソフトウェアエンジニアを支援するための大規模言語モデル（[[LLM]]）に基づく特化型ソリューションの構築が急成長している。しかし、LLMにはユニークな課題があり、管理すべき信頼性に関する新たな懸念が存在する。このことは、2つの説得力のある相補的な研究の軌跡を浮き彫りにしています：ソフトウェアの信頼性のための大規模言語モデル（LLM4SR）と大規模言語モデルのためのソフトウェア信頼性（SR4LLM）です。本講演では、LLM4SRの有望なソリューションとして、脆弱性修復とランタイムエラーリカバリーに焦点を当てて紹介する。その後、LLMに影響を与える信頼性の問題と、それを管理するための予備的なソリューションについて議論し、SR4LLMに多くの研究が必要であることを強調する。講演の最後には、SRとLLMが今後のソフトウェア工学をどのように変えていくことができるのか、将来の方向性について議論する。 > 略歴シンガポールマネージメント大学コンピュータサイエンス学科OUBチェア教授、インテリジェントソフトウェアエンジニアリング研究センター（RISE）ディレクター。 2000年代半ばからソフトウェア工学のためのAI（AI4SE）分野を推進し、データマイニング、機械学習、情報検索、自然言語処理、検索ベースのアルゴリズムを含むAIが、ソフトウェア工学のデータを自動化と洞察に変換できることを実証してきた。彼の貢献は、2つのTest-of-Time賞、ISSRE 2012の業績に対する賞、10つのACM SIGSOFT / IEEE TCSE Distinguished Paper賞を含む20以上の賞を受賞し、3万以上の引用を集めた。 ACMフェロー、IEEEフェロー、ASEフェロー、National Research Foundation Investigator（シニアフェロー）であるLo氏は、ASE'20、FSE'24、ICSE'25のPC共同議長も務めています。詳細は http://www.mysmu.edu/faculty/davidlo/ 。 - Sigapore thrid university - RISE - AI for Software Engineering - SMArTIC FSE'06 - Efficient mining of iterative patterns - KDD'09 - ICSE'10 - MSR'13 - SANER'16 - Roadmap Software engineering - LLMs - ICSME 2020 - Sentiment Analysis - ICSE 2024 - Out of Sight, out of mind - TOSEM 2024 - LLM for softwarre engineerinf: A Systematic Literature - Many Open Problems - Robustness, Security, Privacy, ... - Binary star - LLM4SRE - SR4LLM - RQ - How can we effectively leverage these addtional inputs? - multi LLM collaboration - VulMaster - Diver inputs into Multi LLM - Future - complex vulne - large code contexts - trust and synergy - Self-Healing Software System - They are rather rigits - Efficasy - Future of Software Engineering. ICSE 2023 - VIsion 2033. - FORGE24 - New ACM conference ## Research Track 10: Root Cause Analysis and Program Repair ### SparseRCA: Efficient Root Cause Analysis in Sparse Microservice Testing Trace Zhenhe Yao, Haowei Ye, Changhua Pei, Guang Cheng, Guangpei Wang, Zhiwei Liu, Hongwei Chen, Hang Cui, Zeyan Li, Jianhui Li, Gaogang Xie and Dan Pei [[2024__ISSRE__SparseRCA - Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces]] ### KPIRoot: Efficient Monitoring Metric-based Root Cause Localization in Large-scale Cloud Systems Wenwei Gu, Xinying Sun, Jinyang Liu, Yintong Huo, Zhuangbin Chen, Jianping Zhang, Jiazhen Gu, Yongqiang Yang and Michael Lyu [[2024__ISSRE__KPIRoot - Efficient Monitoring Metric-based Root Cause Localization in Large-scale Cloud Systems]] - Alert KPIs, VM KPIs - Migration VM migration throttling - Thnak your for your interesting presentation. - How Sensitive is your method to the two parameters: sampling intervals of time sereis and the time window size? - How many time serie in your experiment ?[] - Practical requirements - Peraron corr high compoutational - DL high computation cost and lack interpretability. - SAX, Granger Causality and Jaccard similality ### FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications ### RATCHET: Retrieval Augmented Transformer for Program Repair ## Research Track 12: Fault Monitoring, Prediction and Diagnosis ### DRLFailureMonitor: A Dynamic Failure Monitoring Approach for Deep Reinforcement Learning Systems Cai Yi, Zheng Zheng, Wan Xiaohui and Liu Zhihao [[2024__ISSRE__DRLFailureMonitor - A Dynamic Failure Monitoring Approach for Deep Reinforcement Learning System]] - Exitsing work - Testing: via search or fuzzing - limit - offline, - Lack of Generalization - How to monitor - x in a single state - :check: sequantial desicison-making ### Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning Haozhe Li, Minghua Ma, Yudong Liu, Pu Zhao, Shuo Li, Lingling Zheng, Ze Li, Murali Chintalapati, Yingnong Dang, Chetan Bansal, Saravan Rajmohan, Qingwei Lin and Dongmei Zhang [[2024__ISSRE__Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning]] ### Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis [[2024__ISSRE__Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis]] ### Large Language Models Can Provide Accurate and Interpretable Incident Triage [[2024__ISSRE__Large Language Models Can Provide Accurate and Interpretable Incident Triage]] ## Research Track 14: Performance and Reliability Analysis and Prediction ### Understanding Atomics and Memory Ordering Issues in Real-World Rust Software Cheng Wang, Tengfei Tu, Sujuan Qin, Guangjun Wu, Fei Gao and Mingchao Wan ### A Compositional Approach to Coordinated Software Rejuvenation of Component-Based Systems Tommaso Botarelli, Laura Carnevali, Leonardo Paroli and Enrico Vicario ### Feedback-Directed Cross-Layer Optimization of Cloud-Based Functional Actor Applications Andrea Cappelletti and Mark Grechanik ### Exact Computation of Network Reliability with Sentential Decision Diagram Delong Li, Jiayu Zeng, Liangda Fang, Chaonan Wang, Lin Cui and Quanlong Guan ## Closing - DSN 2025 Italy