TL;DR
TeleCom-Bench introduces a comprehensive benchmark to evaluate large language models in telecommunications, highlighting their strengths in understanding telecom knowledge but exposing significant gaps in procedural application tasks.
Contribution
The paper presents a new benchmark with 12 evaluation sets for assessing LLMs in telecom, including knowledge comprehension and end-to-end application tasks, and provides insights into current model limitations.
Findings
Models achieve 90% accuracy in linguistic tasks.
Performance drops to ~30% in procedural tasks.
Current LLMs are effective diagnosticians but not field engineers.
Abstract
While Large Language Models have achieved remarkable integration in various vertical scenarios, their deployment in the telecommunications domain remains exploratory due to the lack of a standardized evaluation framework. Current telecom benchmarks primarily focus on static, foundational knowledge and isolated atomic skills, neglecting the equipment-specific documentation and end-to-end industrial workflows essential for real-world production systems. To bridge this gap, we present TeleCom-Bench, a comprehensive benchmark comprising 12 evaluation sets with 22,678 curated samples, which evaluates LLMs across a synergistic hierarchy: (1) Multi-dimensional Knowledge Comprehension, which integrates telecommunication fundamentals, 3GPP protocols, and 5G network architecture with proprietary product knowledge across wired, core, and wireless networks via knowledge graph-driven synthesis; and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
