Context-Length Robustness in Question Answering Models: A Comparative Empirical Study
Trishita Dhara, Siddhesh Sheth

TL;DR
This study systematically evaluates how large language models' question answering accuracy declines as context length increases, revealing task-dependent robustness differences especially affecting multi-hop reasoning tasks.
Contribution
It provides the first controlled empirical comparison of context-length robustness across QA tasks, highlighting the vulnerability of multi-hop reasoning to context dilution.
Findings
Performance degrades with increasing context length.
Multi-hop reasoning tasks are more affected than single-span extraction.
HotpotQA shows nearly twice the accuracy loss of SQuAD under long contexts.
Abstract
Large language models are increasingly deployed in settings where relevant information is embedded within long and noisy contexts. Despite this, robustness to growing context length remains poorly understood across different question answering tasks. In this work, we present a controlled empirical study of context-length robustness in large language models using two widely used benchmarks: SQuAD and HotpotQA. We evaluate model accuracy as a function of total context length by systematically increasing the amount of irrelevant context while preserving the answer-bearing signal. This allows us to isolate the effect of context length from changes in task difficulty. Our results show a consistent degradation in performance as context length increases, with substantially larger drops observed on multi-hop reasoning tasks compared to single-span extraction tasks. In particular, HotpotQA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems
