Expect the Unexpected: FailSafe Long Context QA for Finance
Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz, Russak, Waseem AlShikh

TL;DR
This paper introduces FailSafeQA, a benchmark for testing the robustness and context-awareness of large language models in financial question-answering, highlighting their strengths and weaknesses in handling perturbations and irrelevant information.
Contribution
The paper presents a new financial benchmark, FailSafeQA, for evaluating LLM robustness and introduces a comprehensive analysis of model performance under various perturbations.
Findings
Some models effectively mitigate input perturbations.
Models struggle to balance robustness with hallucination avoidance.
High-performing models still have significant room for improvement.
Abstract
We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Big Data and Business Intelligence · Reservoir Engineering and Simulation Methods
