Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
Jin-Seop Lee, SungJoon Lee, SeongJun Jung, Boyang Li, Jee-Hyong Lee

TL;DR
This paper introduces a refusal-aware reinforcement fine-tuning approach for video temporal grounding that effectively handles hard-irrelevant queries, improving relevance discrimination and semantic reasoning in VTG models.
Contribution
It proposes RA-RFT, a novel reinforcement learning method with multiple reward objectives, and constructs a Hard-Irrelevant VTG dataset to enhance relevance-aware video grounding.
Findings
RA-RFT improves rejection of hard-irrelevant queries in VTG.
The method outperforms existing models in relevance discrimination.
It is scalable across various VTG architectures.
Abstract
Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant. To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG. Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives-format, refuse-IoU, explain, and query correction-to improve both relevance discrimination and fine-grained semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition
