LGDN: Language-Guided Denoising Network for Video-Language Modeling

Haoyu Lu; Mingyu Ding; Nanyi Fei; Yuqi Huo; Zhiwu Lu

arXiv:2209.11388·cs.CV·December 6, 2022·5 cites

LGDN: Language-Guided Denoising Network for Video-Language Modeling

Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, Zhiwu Lu

PDF

Open Access 1 Video

TL;DR

LGDN is a novel video-language model that dynamically filters noisy frames guided by language, focusing on salient content to improve cross-modal alignment and outperform existing methods on multiple datasets.

Contribution

The paper introduces LGDN, a new model that effectively filters irrelevant video frames using language guidance, addressing noise issues in video-language modeling.

Findings

01

LGDN outperforms state-of-the-art methods on five datasets.

02

Filtering noisy frames improves cross-modal alignment.

03

Ablation studies highlight the importance of noise reduction.

Abstract

Video-language modeling has attracted much attention with the rapid growth of web videos. Most existing methods assume that the video frames and text description are semantically correlated, and focus on video-language modeling at video level. However, this hypothesis often fails for two reasons: (1) With the rich semantics of video contents, it is difficult to cover all frames with a single video-level description; (2) A raw video typically has noisy/meaningless information (e.g., scenery shot, transition or teaser). Although a number of recent works deploy attention mechanism to alleviate this problem, the irrelevant/noisy information still makes it very difficult to address. To overcome such challenge, we thus propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN), for video-language modeling. Different from most existing methods that utilize all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LGDN: Language-Guided Denoising Network for Video-Language Modeling· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition