TeleResilienceBench: Quantifying Resilience for LLM Reasoning in Telecommunications

Pranshav Gajjar; Emmanuel Ojo; and Vijay K Shah

arXiv:2605.09929·cs.LG·May 12, 2026

TeleResilienceBench: Quantifying Resilience for LLM Reasoning in Telecommunications

Pranshav Gajjar, Emmanuel Ojo, and Vijay K Shah

PDF

TL;DR

This paper introduces TeleResilienceBench, a benchmark for measuring reasoning resilience in large language models within telecommunications, revealing current models' limited ability to recover from reasoning errors.

Contribution

The paper presents a new benchmark and metric for assessing models' ability to recover from reasoning failures in telecom tasks, highlighting the gap in current model resilience.

Findings

01

Even the strongest models achieve only 29.1% CFR.

02

Scale does not reliably improve resilience within model families.

03

Nemotron-3-nano 4b outperforms larger Qwen models in resilience-to-cost ratio.

Abstract

Deploying large language models in telecommunications requires more than task accuracy. In realistic workflows, a model may inherit partially completed reasoning from a prior step, an upstream agent, or its own earlier generation, and must continue that reasoning even when it is already going wrong. We introduce TeleResilienceBench, a benchmark that quantifies this capability, which we term reasoning resilience, across seven telecom sub-domains drawn from the GSMA Open-Telco LLM suite. Instances are constructed by collecting failures from a weak generator model, truncating the flawed reasoning trace at its midpoint, and asking a target model to continue and correct it. We propose the Correct Flip Rate (CFR) as a direct measure of successful recovery and evaluate eight models spanning the Qwen3.5, Gemma4, and Nemotron-3 families. Our results show that even the strongest model achieves a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.