Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering
Kishan Maharaj, Nandakishore Menon, Ashita Saxena, Srikanth Tamilselvam

TL;DR
This paper systematically evaluates the robustness of large language models in long-context code question answering, revealing significant vulnerabilities to input variations and highlighting the need for improved reasoning fidelity.
Contribution
It introduces a comprehensive benchmark extending LongCodeBench with new datasets and controlled ablations to assess model robustness in long-context code reasoning tasks.
Findings
Models show performance drops with shuffled options and open-ended questions.
Models are brittle when irrelevant information is present.
The study highlights limitations of current evaluation methods.
Abstract
Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Artificial Intelligence in Healthcare and Education
