Grounded Objects and Interactions for Video Captioning
Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib,, Hans Peter Graf

TL;DR
This paper introduces SINet-Caption, a novel video captioning model that emphasizes grounding language in higher-order object interactions for improved fine-grained understanding, achieving state-of-the-art results.
Contribution
The paper presents a new approach that models higher-order object interactions for video captioning, enhancing fine-grained understanding and grounding of generated descriptions.
Findings
Achieves state-of-the-art results on ActivityNet Captions dataset.
Demonstrates the effectiveness of grounding language in object interactions.
Highlights benefits of fine-grained video understanding.
Abstract
We address the problem of video captioning by grounding language generation on object interactions in the video. Existing work mostly focuses on overall scene understanding with often limited or no emphasis on object interactions to address the problem of video understanding. In this paper, we propose SINet-Caption that learns to generate captions grounded over higher-order interactions between arbitrary groups of objects for fine-grained video understanding. We discuss the challenges and benefits of such an approach. We further demonstrate state-of-the-art results on the ActivityNet Captions dataset using our model, SINet-Caption based on this approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
