Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

Yahya Masri; Emily Ma; Zifu Wang; Joseph Rogers; Chaowei Yang

arXiv:2601.07790·cs.AI·January 13, 2026

Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

Yahya Masri, Emily Ma, Zifu Wang, Joseph Rogers, Chaowei Yang

PDF

Open Access

TL;DR

This paper evaluates small language and reasoning models on system log severity classification, proposing it as a benchmark for real-time log comprehension and model efficiency in digital twin systems.

Contribution

It introduces a benchmark for assessing small models' log understanding, highlighting the impact of architecture, training, and retrieval integration on performance.

Findings

01

Qwen3-4B achieves 95.64% accuracy with RAG.

02

Gemma3-1B improves from 20.25% to 85.28% with RAG.

03

Model efficiency varies, with most completing inference under 1.2 seconds.

Abstract

System logs are crucial for monitoring and diagnosing modern computing infrastructure, but their scale and complexity require reliable and efficient automated interpretation. Since severity levels are predefined metadata in system log messages, having a model merely classify them offers limited standalone practical value, revealing little about its underlying ability to interpret system logs. We argue that severity classification is more informative when treated as a benchmark for probing runtime log comprehension rather than as an end task. Using real-world journalctl data from Linux production servers, we evaluate nine small language models (SLMs) and small reasoning language models (SRLMs) under zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting. The results reveal strong stratification. Qwen3-4B achieves the highest accuracy at 95.64% with RAG, while Gemma3-1B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Software-Defined Networks and 5G · Cloud Computing and Resource Management