LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video   Question Answering

Jingjing Jiang; Ziyi Liu; and Nanning Zheng

arXiv:2111.14547·cs.CV·December 1, 2021

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Jingjing Jiang, Ziyi Liu, and Nanning Zheng

PDF

Open Access

TL;DR

LiVLR is a lightweight, flexible visual-linguistic reasoning framework for VideoQA that effectively integrates multi-modal content at different semantic levels, achieving superior performance on benchmark datasets.

Contribution

The paper introduces LiVLR, a novel lightweight VideoQA framework with a diversity-aware reasoning module for flexible multi-modal content integration.

Findings

01

Outperforms existing methods on MRSVTT-QA and KnowIT VQA datasets.

02

Effective multi-grained visual and linguistic representations.

03

Key components validated through extensive ablation studies.

Abstract

Video Question Answering (VideoQA), aiming to correctly answer the given question based on understanding multi-modal video content, is challenging due to the rich video content. From the perspective of video understanding, a good VideoQA framework needs to understand the video content at different semantic levels and flexibly integrate the diverse video content to distill question-related content. To this end, we propose a Lightweight Visual-Linguistic Reasoning framework named LiVLR. Specifically, LiVLR first utilizes the graph-based Visual and Linguistic Encoders to obtain multi-grained visual and linguistic representations. Subsequently, the obtained representations are integrated with the devised Diversity-aware Visual-Linguistic Reasoning module (DaVL). The DaVL considers the difference between the different types of representations and can flexibly adjust the importance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition