Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation
Xueliang Zhao, Yuxuan Wang, Chongyang Tao, Chenshuo Wang, Dongyan, Zhao

TL;DR
This paper introduces a novel multi-modal reasoning framework for video-grounded dialogue generation, leveraging reasoning paths and multi-agent reinforcement learning to better integrate video and dialogue data with pre-trained language models, leading to significant performance improvements.
Contribution
It proposes a method to extract reasoning paths from videos and employs multi-agent reinforcement learning for collaborative multi-modal reasoning, enhancing integration with pre-trained language models.
Findings
Significant performance improvements over state-of-the-art models.
Effective integration of video and dialogue modalities through reasoning paths.
Superior results on automatic and human evaluations.
Abstract
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video. The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs) which presents obstacles to exploiting the power of large-scale pre-training; and (2) the necessity of taking into account the complementarity of various modalities throughout the reasoning process. Although having made remarkable progress in video-grounded dialogue generation, existing methods still fall short when it comes to integrating with PLMs in a way that allows information from different modalities to complement each other. To alleviate these issues, we first propose extracting pertinent information from videos and turning it into reasoning paths that are acceptable to PLMs. Additionally, we propose a multi-agent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
