TL;DR
This paper introduces a framework and dataset for training Video Large Language Models to assess question relevance and refuse to answer when questions are outside the video's scope, improving real-world applicability.
Contribution
It proposes an alignment for answerability framework and a new dataset to enable Video-LLMs to evaluate question relevance and refuse unanswerable queries.
Findings
Video-LLMs often fail to refuse irrelevant questions.
The proposed framework improves models' ability to reject unfit questions.
A new dataset supports training and evaluating answerability in Video-LLMs.
Abstract
In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding capabilities of these models, they are predominantly trained on questions generated directly from video content. However, in real-world scenarios, users often pose questions that extend beyond the informational scope of the video, highlighting the need for Video-LLMs to assess the relevance of the question. We demonstrate that even the best-performing Video-LLMs fail to reject unfit questions-not necessarily due to a lack of video understanding, but because they have not been trained to identify…
Peer Reviews
Decision·ICLR 2025 Poster
The paper is well-written and clearly introduces the concepts, definitions and metrics. The contribution is original as it formally recognizes a new problem, i.e. multimodal LLMs answering clearly unanswerable questions, and proposes solutions to it. - The proposed problem framing and metric make intuitive sense. The definition takes into account the reasoning of why a question is unanswerable. The metric takes into account excessive refusal to answer which is a big problem with LLMs. - A new da
The main weakness is that the proposed QAs in the dataset seem to be mostly geared towards detection capabilities without requiring much reasoning. This in turn makes the corresponding dataset construction, and improving model's capabilities on this axis rather easy. Some qualitative examples presented are "What breed is the cat in the video?", "What color laptop the presenter is holding?", "How many times does a person in gray shirt appear in the video?", etc. Recently, many reasoning based vid
- This paper executed a valid pipeline of improving VLLM on a task: identifying the problem, curating a dataset to fix the problem, and finetuning the model to show improvements. The paper shows a good practice on the task of introducing refusing to answer for VLLMs. - The alignment score defined in section 4 makes sense to me. - The authors conducted experiments on a variety number of VLLMs in Table 1.
- My main feeling about reading this paper is I feel the problem of refusing to answer is a bit small and artificial. While showing the failure case in Figure 1, I feel it is also important to show if the problem can be migrated by explicit prompting, e.g., add to the prompt "Say 'can't answer' if you are not sure". Even though I believe the authors proposed method with finetuning on the curated dataset may still be better, I feel the problem is not as big as the author claimed. - Adding to the
1. The work is overall well-developed and shows the authors’ insights in the problem. 2. The constructed benchmark could be valuable, with dedicated evaluation metrics and annotations (reason for unanswerable). 3. The paper implements both SFT and DPO for improving existing Video-LLMs with the training data and shares helpful insights.
1. The paper neglects to discuss many existing works that study the unanwerability of (V)QA models (see my attached references). 2. It would be better to conduct more in-depth analysis to help understand what kinds of unanswerable QA are more challenging to resolve in videos: static objects/attributes/relations vs. dynamic actions. [1] Whitehead S, Petryk S, Shakib V, et al. Reliable visual question answering: Abstain rather than answer incorrectly[C]//European Conference on Computer Vision
Videos
