MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Sangho Lee; Il Yong Chun; Hogun Park

arXiv:2501.18269·cs.CV·October 9, 2025

MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Sangho Lee, Il Yong Chun, Hogun Park

PDF

Open Access 1 Video

TL;DR

This paper introduces a model-agnostic framework for adaptive frame selection and token subset construction in video captioning, improving caption quality by focusing on important visual information.

Contribution

It proposes the first adaptive, model-agnostic module selection framework and an attention masking scheme to enhance video captioning performance.

Findings

01

Significant performance improvements on benchmark datasets.

02

Effective selection of relevant visual tokens.

03

Enhanced attention on important visual features.

Abstract

Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MAMS: Model-Agnostic Module Selection Framework for Video Captioning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsSoftmax · Attention Is All You Need