Commonsense Video Question Answering through Video-Grounded Entailment   Tree Reasoning

Huabin Liu; Filip Ilievski; Cees G. M. Snoek

arXiv:2501.05069·cs.CV·March 26, 2025

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Huabin Liu, Filip Ilievski, Cees G. M. Snoek

PDF

Open Access

TL;DR

This paper introduces a novel video-grounded entailment tree reasoning approach for commonsense video question answering, addressing biases in current models and enhancing reasoning capabilities across various benchmarks.

Contribution

It presents the first explicit entailment tree reasoning framework for VQA, improving generalizability and fairness in evaluating visual-language models.

Findings

01

Effective in reducing bias in VQA benchmarks

02

Enhances reasoning accuracy across multiple VLMs

03

Demonstrates improved performance on de-biased benchmarks

Abstract

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling