TeleResilienceBench: Quantifying Resilience for LLM Reasoning in Telecommunications
Pranshav Gajjar, Emmanuel Ojo, and Vijay K Shah

TL;DR
This paper introduces TeleResilienceBench, a benchmark for measuring reasoning resilience in large language models within telecommunications, revealing current models' limited ability to recover from reasoning errors.
Contribution
The paper presents a new benchmark and metric for assessing models' ability to recover from reasoning failures in telecom tasks, highlighting the gap in current model resilience.
Findings
Even the strongest models achieve only 29.1% CFR.
Scale does not reliably improve resilience within model families.
Nemotron-3-nano 4b outperforms larger Qwen models in resilience-to-cost ratio.
Abstract
Deploying large language models in telecommunications requires more than task accuracy. In realistic workflows, a model may inherit partially completed reasoning from a prior step, an upstream agent, or its own earlier generation, and must continue that reasoning even when it is already going wrong. We introduce TeleResilienceBench, a benchmark that quantifies this capability, which we term reasoning resilience, across seven telecom sub-domains drawn from the GSMA Open-Telco LLM suite. Instances are constructed by collecting failures from a weak generator model, truncating the flawed reasoning trace at its midpoint, and asking a target model to continue and correct it. We propose the Correct Flip Rate (CFR) as a direct measure of successful recovery and evaluate eight models spanning the Qwen3.5, Gemma4, and Nemotron-3 families. Our results show that even the strongest model achieves a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
