EVC-MF: End-to-end Video Captioning Network with Multi-scale Features

Tian-Zi Niu; Zhen-Duo Chen; Xin Luo; Xin-Shun Xu

arXiv:2410.16624·cs.CV·October 23, 2024

EVC-MF: End-to-end Video Captioning Network with Multi-scale Features

Tian-Zi Niu, Zhen-Duo Chen, Xin Luo, Xin-Shun Xu

PDF

Open Access

TL;DR

This paper introduces EVC-MF, an end-to-end video captioning network that directly learns multi-scale visual features and effectively fuses them, improving adaptability and reducing redundancy compared to traditional offline-feature-based methods.

Contribution

The proposed EVC-MF model is the first end-to-end framework to learn and utilize multi-scale features directly from video frames for captioning, eliminating reliance on fixed offline extractors.

Findings

01

Achieves competitive results on benchmark datasets.

02

Effectively reduces feature redundancy and improves feature utilization.

03

Demonstrates adaptability by updating feature extractor parameters during training.

Abstract

Conventional approaches for video captioning leverage a variety of offline-extracted features to generate captions. Despite the availability of various offline-feature-extractors that offer diverse information from different perspectives, they have several limitations due to fixed parameters. Concretely, these extractors are solely pre-trained on image/video comprehension tasks, making them less adaptable to video caption datasets. Additionally, most of these extractors only capture features prior to the classifier of the pre-training task, ignoring a significant amount of valuable shallow information. Furthermore, employing multiple offline-features may introduce redundant information. To address these issues, we propose an end-to-end encoder-decoder-based network (EVC-MF) for video captioning, which efficiently utilizes multi-scale visual and textual features to generate video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Vision and Imaging · Video Analysis and Summarization