TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework   with Multimodal Data

Shuaiyu Xie; Jian Wang; Hanbin He; Zhihao Wang; Yuqi Zhao; Neng Zhang,; Bing Li

arXiv:2407.19711·cs.SE·March 24, 2025·1 cites

TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework with Multimodal Data

Shuaiyu Xie, Jian Wang, Hanbin He, Zhihao Wang, Yuqi Zhao, Neng Zhang,, Bing Li

PDF

Open Access 1 Repo

TL;DR

TVDiag is a novel multimodal failure diagnosis framework for microservice systems that uses task-oriented and view-invariant learning to improve accuracy in locating failures and identifying failure types.

Contribution

It introduces a task-oriented, view-invariant multimodal failure diagnosis framework with contrastive learning and graph data augmentation, addressing limitations of previous methods.

Findings

01

Outperforms state-of-the-art in failure diagnosis accuracy.

02

Achieves at least 55.94% higher HR@1 accuracy.

03

Increases F1-score by over 4.08%.

Abstract

Microservice-based systems often suffer from reliability issues due to their intricate interactions and expanding scale. With the rapid growth of observability techniques, various methods have been proposed to achieve failure diagnosis, including root cause localization and failure type identification, by leveraging diverse monitoring data such as logs, metrics, or traces. However, traditional failure diagnosis methods that use single-modal data can hardly cover all failure scenarios due to the restricted information. Several failure diagnosis methods have been recently proposed to integrate multimodal data based on deep learning. These methods, however, tend to combine modalities indiscriminately and treat them equally in failure diagnosis, ignoring the relationship between specific modalities and different diagnostic tasks. This oversight hinders the effective utilization of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

whu-aise/tvdiag
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability