When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?
Xinyu Zhou, Chang Jin, Carsten Eickhoff, Zhijiang Guo, Seyed Ali Bahrainian

TL;DR
This paper explores training large language models to abstain from answering in temporal question answering, using a novel RL-based approach with Chain-of-Thought supervision to improve reasoning and reliability.
Contribution
It introduces a reinforcement learning framework combined with Chain-of-Thought supervision to enable LLMs to learn abstention, improving temporal reasoning accuracy and reliability.
Findings
RL improves reasoning accuracy over supervised fine-tuning.
RL increases true positive rate on unanswerable questions.
SFT induces overconfidence and reduces reliability.
Abstract
Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in…
Peer Reviews
Decision·ICLR 2026 Poster
- **Relevance and Importance**: The paper addresses an important and timely problem. Current models struggle to abstain and the proposed RL technique is simple and effective. - **Strong Results**: The results, although surprising, are strong. RL significantly beats SFT, even for SFT models of larger sizes.
- **Poor Structure and Presentation**: The paper’s organization lacks coherence. Sections jump between unrelated topics (e.g., implicit reasoning, KG extraction, RL training) without clear motivation or integration into the main story. Figures and experiments are presented out of logical order, reducing readability. - **Weak Experimental Design**: Some experiments feel arbitrary or poorly motivated. Dataset choices, baselines, and prompt configurations are insufficiently justified, and several
The authors tackle the important open problem of the best approach to teach models the skill of abstention for temporal questions. The authors cover a reasonable set of closed and open models as well as explore various setups for inducing abstention, including various approaches to including context. The authors make reasonable choices in terms of post-training methods (GRPO, SFT) and adapt them for abstention. I commend the authors on the perspective that abstention is a learnable skill and the
The authors offer some nice findings comparing post-training approaches for abstention, including the lack of success of some approaches (SFT). One aspect that could be improved here is some more intuition regarding why some setups work better than others. There is a growing body of literature along the lines of https://arxiv.org/abs/2501.17161 which explains memorization and generalization learning dynamics of post-training approaches that can be used to better contextualize this works' finding
1. Timely focus on abstention + temporal reasoning. 2. Comparison across input settings (question only, full context, time-filtered sub-context, KGs), model scales, SFT vs RL, and prompt variants sheds insights into the domain. 3. The experimental setup is detailed nicely for reproduction. 4. Some interesting analysis is performed, including: 4.1. SFT increases overconfidence 4.2. Increasing unanswerable questions in training can collapse the model 4.3. Impact of KG on abstention
1. **Lack of Benchmarks** - Results are confined to TimeQA. Other temporal reasoning sets (e.g., [1,2]) would better validate generality. The OOD experiments (Table 4, p. 9) focus on non‑temporal datasets and show very poor transfer after RL (e.g., TP -> 0 on RL+c), which underscores brittleness. 2. **Heavy reliance on GPT-o1 for CoT** - CoT collection use GPT‑o1. This raises questions about measuring knowledge distillation from larger models, rather than assessing the impact of suggested train
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Multimodal Machine Learning Applications
