Video Captioning in Compressed Video
Mingjian Zhu, Chenrui Duan, Changbin Yu

TL;DR
This paper introduces a novel video captioning method that directly utilizes compressed video data, leveraging residuals for saliency detection and a temporal gate to improve caption accuracy, demonstrating effectiveness on benchmark datasets.
Contribution
The paper presents a residuals-assisted encoder and a temporal gate module for improved video captioning directly from compressed videos, reducing reliance on uncompressed data.
Findings
Effective captioning performance on benchmark datasets
Improved focus on salient regions via residuals-based attention
Robustness against noisy signals in compressed videos
Abstract
Existing approaches in video captioning concentrate on exploring global frame features in the uncompressed videos, while the free of charge and critical saliency information already encoded in the compressed videos is generally neglected. We propose a video captioning method which operates directly on the stored compressed videos. To learn a discriminative visual representation for video captioning, we design a residuals-assisted encoder (RAE), which spots regions of interest in I-frames under the assistance of the residuals frames. First, we obtain the spatial attention weights by extracting features of residuals as the saliency value of each location in I-frame and design a spatial attention module to refine the attention weights. We further propose a temporal gate module to determine how much the attended features contribute to the caption generation, which enables the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging
MethodsAverage Pooling · Convolution · Max Pooling · Sigmoid Activation · Communication--Guide||How Do I Communicate to Expedia?
