Video Question Answering with Phrases via Semantic Roles
Arka Sadhu, Kan Chen, Ram Nevatia

TL;DR
This paper introduces VidQAP, a new approach to Video Question Answering that uses semantic roles to allow fill-in-the-phrase answers, enabling more flexible and realistic evaluation beyond single-word responses.
Contribution
The work proposes a novel fill-in-the-phrase VidQA task using semantic roles, introduces new datasets, and extends existing models for benchmarking and analysis.
Findings
VidQAP improves answer flexibility in VidQA evaluation.
Constructed new datasets: ActivityNet-SRL-QA and Charades-SRL-QA.
Extended three vision-language models for benchmarking.
Abstract
Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models' application scenario. In this work, we leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fill-in-the-phrase task. To enable evaluation of answer phrases, we compute the relative improvement of the predicted answer compared to an empty string. To reduce the influence of language bias in VidQA datasets, we retrieve a video having a different answer for the same question. To facilitate research, we construct ActivityNet-SRL-QA and Charades-SRL-QA and benchmark them by extending three vision-language models. We further perform extensive analysis and ablative studies to guide future work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
