When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation
Chaitanya Vilas Garware, and Sharif Noor Zisad

TL;DR
This paper reveals that parsing methods significantly impact the evaluation of LLM-based security log classifiers, with fuzzy parsing recovering much higher threat detection accuracy than strict regex parsing, highlighting evaluation methodology flaws.
Contribution
It identifies parsing-induced suppression as a systematic evaluation error and introduces SOC-Bench v0, a benchmark framework to standardize threat classification and improve evaluation reliability.
Findings
Strict regex parser reported 0% threat accuracy, fuzzy parser recovered 76%.
Severity accuracy remained at 58% under both parsers, indicating model stability.
Residual errors mainly involved reconnaissance, brute force, and credential stuffing logs.
Abstract
LLM-based SOC log classifiers are commonly evaluated using regular-expression pipelines that extract structured fields from free-form model output. We demonstrate that this practice introduces a class of silent, systematic evaluation errors, which we term parsing-induced suppression that can cause a fully functional model to appear completely non-functional. Using OpenSOC-AI, a LoRA fine-tuned TinyLlama-1.1B system for security log threat classification, as a reproducible case study, we show that a strict regex parser reported 0% threat accuracy while a corrected fuzzy parser recovered 76% threat accuracy on the same model outputs and the same evaluation set. A gap of 76 percentage points attributable entirely to evaluation methodology. Severity accuracy remained constant at 58% under both parsers, providing a built-in control that isolates field name format mismatch as the causal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
