Grounded Objects and Interactions for Video Captioning

Chih-Yao Ma; Asim Kadav; Iain Melvin; Zsolt Kira; Ghassan AlRegib,; Hans Peter Graf

arXiv:1711.06354·cs.CV·November 20, 2017·5 cites

Grounded Objects and Interactions for Video Captioning

Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib,, Hans Peter Graf

PDF

Open Access

TL;DR

This paper introduces SINet-Caption, a novel video captioning model that emphasizes grounding language in higher-order object interactions for improved fine-grained understanding, achieving state-of-the-art results.

Contribution

The paper presents a new approach that models higher-order object interactions for video captioning, enhancing fine-grained understanding and grounding of generated descriptions.

Findings

01

Achieves state-of-the-art results on ActivityNet Captions dataset.

02

Demonstrates the effectiveness of grounding language in object interactions.

03

Highlights benefits of fine-grained video understanding.

Abstract

We address the problem of video captioning by grounding language generation on object interactions in the video. Existing work mostly focuses on overall scene understanding with often limited or no emphasis on object interactions to address the problem of video understanding. In this paper, we propose SINet-Caption that learns to generate captions grounded over higher-order interactions between arbitrary groups of objects for fine-grained video understanding. We discuss the challenges and benefits of such an approach. We further demonstrate state-of-the-art results on the ActivityNet Captions dataset using our model, SINet-Caption based on this approach.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning