Adaptive Feature Abstraction for Translating Video to Text
Yunchen Pu, Martin Renqiang Min, Zhe Gan, Lawrence Carin

TL;DR
This paper introduces an adaptive feature selection method using a novel attention mechanism to improve video captioning by dynamically focusing on different CNN layers and regions, leading to more semantically rich descriptions.
Contribution
It proposes a new adaptive attention mechanism that sequentially selects features from multiple CNN layers and regions for enhanced video-to-text translation.
Findings
Improved captioning performance on benchmark datasets.
Effective visualization of feature selection process.
Demonstrated superiority over fixed feature extraction methods.
Abstract
Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features. However, the variable context-dependent semantics in the video may make it more appropriate to adaptively select features from the multiple CNN layers. We propose a new approach for generating adaptive spatiotemporal representations of videos for the captioning task. A novel attention mechanism is developed, that adaptively and sequentially focuses on different layers of CNN features (levels of feature "abstraction"), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
