REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment
Priyanka Mudgal

TL;DR
REFLEX is a novel reference-free evaluation metric for log summarization that leverages large language models to assess quality without needing reference summaries.
Contribution
It introduces a scalable, LLM-based evaluation method that outperforms traditional metrics and does not require gold-standard references or human annotations.
Findings
REFLEX provides stable and interpretable evaluations.
It more effectively distinguishes between different model outputs.
It works well across multiple log summarization datasets.
Abstract
Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
