Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion
Yutao Jin, Bin Liu, Jing Wang

TL;DR
This paper introduces a novel video captioning model that employs dual graphs and gated fusion to better capture spatio-temporal relations and generate more accurate and comprehensive video descriptions, achieving state-of-the-art results.
Contribution
The paper proposes a dual-graphs and gated fusion framework for video captioning, enhancing feature representation of appearance and motion through graph reasoning and multi-level information aggregation.
Findings
Achieves state-of-the-art performance on MSVD and MSR-VTT datasets.
Effectively models appearance and motion features using dual graphs.
Improves semantic understanding of video content through gated fusion.
Abstract
The application of video captioning models aims at translating the content of videos by using accurate natural language. Due to the complex nature inbetween object interaction in the video, the comprehensive understanding of spatio-temporal relations of objects remains a challenging task. Existing methods often fail in generating sufficient feature representations of video content. In this paper, we propose a video captioning model based on dual graphs and gated fusion: we adapt two types of graphs to generate feature representations of video content and utilize gated fusion to further understand these different levels of information. Using a dual-graphs model to generate appearance features and motion features respectively can utilize the content correlation in frames to generate various features from multiple perspectives. Among them, dual-graphs reasoning can enhance the content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
Methodsfail
