VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks
Hung Le, Nancy F. Chen, Steven C.H. Hoi

TL;DR
This paper introduces VGNMN, a neural module network designed for video-grounded language tasks, effectively handling temporal variance and cross-turn dependencies in dialogues, and demonstrating promising results on benchmark datasets.
Contribution
The paper presents a novel neural module network architecture tailored for video-grounded language tasks, extending NMN applications from images to videos with explicit language component decomposition.
Findings
Achieves promising performance on video-grounded dialogue benchmarks.
Effectively models temporal variance and cross-turn dependencies.
Demonstrates the applicability of NMN in complex video-language tasks.
Abstract
Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded dialogue tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance and language cross-turn dependencies. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components in dialogues to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
