VGNMN: Video-grounded Neural Module Network to Video-Grounded Language   Tasks

Hung Le; Nancy F. Chen; Steven C.H. Hoi

arXiv:2104.07921·cs.CV·June 14, 2022

VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Hung Le, Nancy F. Chen, Steven C.H. Hoi

PDF

Open Access

TL;DR

This paper introduces VGNMN, a neural module network designed for video-grounded language tasks, effectively handling temporal variance and cross-turn dependencies in dialogues, and demonstrating promising results on benchmark datasets.

Contribution

The paper presents a novel neural module network architecture tailored for video-grounded language tasks, extending NMN applications from images to videos with explicit language component decomposition.

Findings

01

Achieves promising performance on video-grounded dialogue benchmarks.

02

Effectively models temporal variance and cross-turn dependencies.

03

Demonstrates the applicability of NMN in complex video-language tasks.

Abstract

Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded dialogue tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance and language cross-turn dependencies. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components in dialogues to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning