Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
Huabin Liu, Filip Ilievski, Cees G. M. Snoek

TL;DR
This paper introduces a novel video-grounded entailment tree reasoning approach for commonsense video question answering, addressing biases in current models and enhancing reasoning capabilities across various benchmarks.
Contribution
It presents the first explicit entailment tree reasoning framework for VQA, improving generalizability and fairness in evaluating visual-language models.
Findings
Effective in reducing bias in VQA benchmarks
Enhances reasoning accuracy across multiple VLMs
Demonstrates improved performance on de-biased benchmarks
Abstract
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling
