TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications
Pranshav Gajjar, Vijay K Shah

TL;DR
TeleEmbedBench is a specialized multi-corpus benchmark for evaluating embedding models in telecommunications, highlighting the superior performance of LLM-based embedders over traditional sentence-transformers.
Contribution
Introduces TeleEmbedBench, the first large-scale, multi-corpus telecommunications embedding benchmark, with an automated query generation pipeline and comprehensive evaluation of embedding models.
Findings
LLM-based embedders outperform traditional sentence-transformers in accuracy and robustness.
Domain-specific instructions improve code retrieval but reduce natural language specification retrieval.
The benchmark spans diverse telecommunications corpora, totaling 9,000 question-chunk pairs.
Abstract
Large language models (LLMs) are increasingly deployed in the telecommunications domain for critical tasks, relying heavily on Retrieval-Augmented Generation (RAG) to adapt general-purpose models to continuously evolving standards. However, a significant gap exists in evaluating the embedding models that power these RAG pipelines, as general-purpose benchmarks fail to capture the dense, acronym-heavy, and highly cross-referential nature of telecommunications corpora. To address this, we introduce TeleEmbedBench, the first large-scale, multi-corpus embedding benchmark designed specifically for telecommunications. The benchmark spans three heterogeneous corpora: O-RAN Alliance specifications, 3GPP release documents, and the srsRAN open-source codebase, comprising 9,000 question-chunk pairs across three standard chunk sizes (512, 1024, and 2048 tokens). To construct this dataset at scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
