Video Captioning in Compressed Video

Mingjian Zhu; Chenrui Duan; Changbin Yu

arXiv:2101.00359·cs.CV·January 5, 2021

Video Captioning in Compressed Video

Mingjian Zhu, Chenrui Duan, Changbin Yu

PDF

Open Access

TL;DR

This paper introduces a novel video captioning method that directly utilizes compressed video data, leveraging residuals for saliency detection and a temporal gate to improve caption accuracy, demonstrating effectiveness on benchmark datasets.

Contribution

The paper presents a residuals-assisted encoder and a temporal gate module for improved video captioning directly from compressed videos, reducing reliance on uncompressed data.

Findings

01

Effective captioning performance on benchmark datasets

02

Improved focus on salient regions via residuals-based attention

03

Robustness against noisy signals in compressed videos

Abstract

Existing approaches in video captioning concentrate on exploring global frame features in the uncompressed videos, while the free of charge and critical saliency information already encoded in the compressed videos is generally neglected. We propose a video captioning method which operates directly on the stored compressed videos. To learn a discriminative visual representation for video captioning, we design a residuals-assisted encoder (RAE), which spots regions of interest in I-frames under the assistance of the residuals frames. First, we obtain the spatial attention weights by extracting features of residuals as the saliency value of each location in I-frame and design a spatial attention module to refine the attention weights. We further propose a temporal gate module to determine how much the attended features contribute to the caption generation, which enables the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging

MethodsAverage Pooling · Convolution · Max Pooling · Sigmoid Activation · Communication--Guide||How Do I Communicate to Expedia?