Semantic Grouping Network for Video Captioning
Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo

TL;DR
The Semantic Grouping Network (SGN) for video captioning dynamically aligns video frame groups with partially decoded captions, reducing redundancy and improving caption accuracy through contrastive attention, achieving state-of-the-art results.
Contribution
The paper introduces a novel SGN that groups video frames based on discriminating word phrases, enabling dynamic video representation updates and improved captioning accuracy.
Findings
Outperforms previous methods by 2.1% and 2.4% CIDEr-D scores on MSVD and MSR-VTT datasets.
Effectively reduces redundancy by clustering semantically related frames.
Demonstrates high interpretability and effectiveness through extensive experiments.
Abstract
This paper considers a video caption generating network referred to as Semantic Grouping Network (SGN) that attempts (1) to group video frames with discriminating word phrases of partially decoded caption and then (2) to decode those semantically aligned groups in predicting the next word. As consecutive frames are not likely to provide unique information, prior methods have focused on discarding or merging repetitive information based only on the input video. The SGN learns an algorithm to capture the most discriminating word phrases of the partially decoded caption and a mapping that associates each phrase to the relevant video frames - establishing this mapping allows semantically related frames to be clustered, which reduces redundancy. In contrast to the prior methods, the continuous feedback from decoded words enables the SGN to dynamically update the video representation that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
