ContraLog: Log File Anomaly Detection with Contrastive Learning and Masked Language Modeling
Simon Dietz, Kai Klede, An Nguyen, Bjoern M Eskofier

TL;DR
ContraLog introduces a novel parser-free, self-supervised approach for log anomaly detection that leverages contrastive learning and masked language modeling to generate meaningful message embeddings, improving detection accuracy on complex datasets.
Contribution
It presents ContraLog, a new method that predicts continuous message embeddings without log parsers, combining masked language modeling and contrastive learning for improved anomaly detection.
Findings
Effective on diverse benchmark datasets
Message embeddings are informative for anomaly prediction
Embedding-level prediction offers a new approach for log analysis
Abstract
Log files record computational events that reflect system state and behavior, making them a primary source of operational insights in modern computer systems. Automated anomaly detection on logs is therefore critical, yet most established methods rely on log parsers that collapse messages into discrete templates, discarding variable values and semantic content. We propose ContraLog, a parser-free and self-supervised method that reframes log anomaly detection as predicting continuous message embeddings rather than discrete template IDs. ContraLog combines a message encoder that produces rich embeddings for individual log messages with a sequence encoder to model temporal dependencies within sequences. The model is trained with a combination of masked language modeling and contrastive learning to predict masked message embeddings based on the surrounding context. Experiments on the HDFS,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The novelty of the proposed method is good. The core idea of this research is reframing log anomaly detection from predicting discrete tokens to predicting continuous embeddings of raw messages, which distinguishes ContraLog from existing approaches. The combination of masked language modeling and contrastive learning is also novel and effective. 2. The paper is clearly written and well structured. The research motivation is well explained in the introduction and related work section. The de
1. The contents about the masked language modeling part is not very clear. Please see the questions below. The sequence encoder part in section 3.1 seems that it takes all the message representations as input, but only the masked messages are used for computing the contrastive loss. 2. The experiment section in the formal contents is very short, and only contains a single group of results. I think an ablation study about the essential parts of the proposed method like message masking and contra
1. The central concept of predicting continuous embeddings instead of discrete log keys is a strong contribution. This directly addresses the information loss problem inherent in parser-based methods, which often discard important details contained in variable parameters. 2. The dual-pronged anomaly scoring mechanism is a practical and well-thought-out design. The point anomaly score provides a safety net for cases where contextual information is weak or misleading, such as a sequence of identic
1. The "parser-free" claim feels a bit overstated. The method uses a custom Byte-Pair Encoding (BPE) tokenizer trained on each specific dataset. This is still a data-dependent preprocessing step that learns the statistical patterns of the log text, which is conceptually not so different from what a parser aims to achieve, albeit at a different level of granularity. "Template-free" might be a more accurate descriptor. 2. The training process seems to require a fair amount of dataset-specific tuni
W1. The core idea of shifting from discrete template prediction to continuous embedding prediction is a significant strength. This directly addresses the well-known limitations of parser-based methods, namely the loss of information from variable parameters and the inability to capture semantic similarity between different templates. W2. Comparing actual and predicted values is a common approach in time-series anomaly detection. This work innovatively applies this method to anomaly detection in
S1. In Table 2, the F1-score calculations for the ContraLog model across the three datasets appear to be incorrect. For example, on the HDFS dataset, the precision is 85.52 and the recall is 83.58, yielding an F1-score of 84.54, whereas Table 2 reports 83.35. Similar discrepancies are found for the BGL and Thunderbird datasets. The authors should provide an explanation for these differences. S2. The Introduction section states, “Parsing Errors: Parsers require dataset-specific rules and frequent
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Network Security and Intrusion Detection · Anomaly Detection Techniques and Applications
