Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis
Junjie Huang, Zhihan Jiang, Jinyang Liu, Yintong Huo, Jiazhen Gu,, Zhuangbin Chen, Cong Feng, Hui Dong, Zengyin Yang, Michael R. Lyu

TL;DR
This paper introduces LoFI, a novel approach that automatically extracts fault-indicating information from logs to improve failure diagnosis in online service systems, outperforming existing methods.
Contribution
LoFI is a new two-stage method combining semantic filtering and prompt-based language model tuning for extracting fault-related log information.
Findings
LoFI achieves 25.8-37.9 higher F1 score than baselines.
LoFI effectively identifies fault-indicating descriptions and parameters.
Deployment and user studies confirm LoFI's practical utility.
Abstract
Logs are imperative in the maintenance of online service systems, which often encompass important information for effective failure mitigation. While existing anomaly detection methodologies facilitate the identification of anomalous logs within extensive runtime data, manual investigation of log messages by engineers remains essential to comprehend faults, which is labor-intensive and error-prone. Upon examining the log-based troubleshooting practices at CloudA, we find that engineers typically prioritize two categories of log information for diagnosis. These include fault-indicating descriptions, which record abnormal system events, and fault-indicating parameters, which specify the associated entities. Motivated by this finding, we propose an approach to automatically extract such faultindicating information from logs for fault diagnosis, named LoFI. LoFI comprises two key stages. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Software System Performance and Reliability · Machine Fault Diagnosis Techniques
Methodstravel james
