LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis

Lecheng Zheng; Zhengzhang Chen; Dongjie Wang; Chengyuan Deng; Reon Matsuoka; Haifeng Chen

arXiv:2406.05375·cs.AI·May 20, 2025·2 cites

LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis

Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, Haifeng Chen

PDF

Open Access 4 Reviews

TL;DR

LEMMA-RCA is a comprehensive, large-scale multi-modal dataset for root cause analysis across diverse real-world systems, facilitating research and development in this critical field.

Contribution

We introduce LEMMA-RCA, the first large, open-source dataset for multi-domain, multi-modal root cause analysis, enabling new research opportunities.

Findings

01

Eight baseline methods tested, showing dataset quality

02

High performance across offline and online modes

03

Effective across multiple modalities

Abstract

Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scenarios from IT and OT operation systems, encompassing microservices, water distribution, and water treatment systems, with hundreds of system entities involved. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset under various settings, including offline and online modes as well as single and multiple modalities. Our experimental results demonstrate the high quality of LEMMA-RCA. The dataset is publicly available at…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. Releases a useful data resource for the community on root-cause analysis.which is distinguished from existing datasets by being multi-modal (textual logs and time-series), and real. 2. Evaluates six existing methods on the released datasets.

Weaknesses

1. It is difficut to justify an ICLR paper just on the basis of releasing a dataset. There are conferences with special tracks on datasets and benchmarking. The paper is best submitted to such tracks. 2. Most of the methods evaluated are not mainstream to the AI/ML/DL community, so relevance to ICLR is of question. Here are some papers that are missed: 2a: Root cause analysis of outliers with missing structural knowledge N Okati, SHG Mejia, WR Orchard, P Blöbaum… -NeurIPS 2025

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper successfully identifies and fills a major gap in the RCA research landscape by providing a large-scale, public benchmark. 2. The dataset contains real system faults (or realistic induced faults) across different domains (IT and OT), which is a major step up from datasets with purely synthetic or simplistic faults. 3. The inclusion of both time-series metrics and textual logs allows for the development and evaluation of multi-modal RCA methods. The data is structured to support both

Weaknesses

1. The evaluated baselines are primarily traditional causal discovery or statistical methods. Given the recent surge of interest in using LLMs for diagnostics and RCA, the absence of any LLM-based baseline is a notable omission. Including even a simple zero-shot LLM baseline would have provided a valuable modern reference point. 2. The feature extraction pipeline for logs is quite specific and multi-faceted (combining template frequency, keyword signals, and TF-IDF). This introduces a potential

Reviewer 03Rating 2Confidence 4

Strengths

- The paper constructs a multimodal RCA dataset, which is meaningful given the data scarcity in this field. - The proposed dataset is collected from multiple systems (IT + OT) and supports both offline and online RCA evaluation. - The experimental section compares multiple baselines in both online and offline settings

Weaknesses

- The microservice failures are injected, while the paper states that the dataset contains "real faults". In Appendix D, the authors describe in detail the steps used to generate failures in microservices systems. If my understanding is correct, this indicates that the faults in the microservices are artificially injected rather than collected from real-world cases. However, in Table 1, the authors state that their dataset contains real faults. These two statements appear contradictory to me. Th

Reviewer 04Rating 4Confidence 2

Strengths

1. The paper makes a strong case that RCA lacks large, open, realistic datasets across domains and modalities, then directly addresses this gap with IT and OT data at second-level granularity and millions of log events. 2. The dataset enables metric only, log only, and multi-modal settings, and provides a concrete online protocol with streaming snapshots. This is timely because most RCA works are offline and single-modal. 3. Six public baselines are run with fixed hyperparameters. Results show

Weaknesses

1. Many IT scenarios are induced on in-house platforms. OT segments are standardized to two-hour windows and may concatenate normal data around attacks. These choices aid benchmarking but can shift distributions and simplify temporal context, which could bias methods tuned to the benchmark 2. Root cause labels are described at entity level, but the paper does not deeply detail labeling procedures, annotator reliability, or ambiguity handling when multiple entities co-cause failures. 3. Using d

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSeismology and Earthquake Studies · Drilling and Well Engineering