Towards Holistic Language-video Representation: the language   model-enhanced MSR-Video to Text Dataset

Yuchen Yang; Yingxuan Duan

arXiv:2406.13809·cs.MM·June 21, 2024

Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset

Yuchen Yang, Yingxuan Duan

PDF

Open Access

TL;DR

This paper proposes an automatic, multifaceted approach to enhance video-language datasets with detailed, context-aware descriptions, improving the quality of language-video representations for retrieval tasks.

Contribution

It introduces a novel method combining multifaceted captioning and language model-based description generation to improve dataset quality for better video understanding.

Findings

01

Enhanced dataset improves retrieval performance

02

Multifaceted captions capture richer video information

03

Language model-generated descriptions are high-quality and scalable

Abstract

A more robust and holistic language-video representation is the key to pushing video understanding forward. Despite the improvement in training strategies, the quality of the language-video dataset is less attention to. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks where queries are much more complex. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware for more sophisticated representation learning needs, hence helping all downstream tasks. Our multifaceted video captioning method captures entities, actions, speech transcripts, aesthetics, and emotional cues, providing detailed and correlating information from the text side to the video side for training. We also develop an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Focus