LGDN: Language-Guided Denoising Network for Video-Language Modeling
Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, Zhiwu Lu

TL;DR
LGDN is a novel video-language model that dynamically filters noisy frames guided by language, focusing on salient content to improve cross-modal alignment and outperform existing methods on multiple datasets.
Contribution
The paper introduces LGDN, a new model that effectively filters irrelevant video frames using language guidance, addressing noise issues in video-language modeling.
Findings
LGDN outperforms state-of-the-art methods on five datasets.
Filtering noisy frames improves cross-modal alignment.
Ablation studies highlight the importance of noise reduction.
Abstract
Video-language modeling has attracted much attention with the rapid growth of web videos. Most existing methods assume that the video frames and text description are semantically correlated, and focus on video-language modeling at video level. However, this hypothesis often fails for two reasons: (1) With the rich semantics of video contents, it is difficult to cover all frames with a single video-level description; (2) A raw video typically has noisy/meaningless information (e.g., scenery shot, transition or teaser). Although a number of recent works deploy attention mechanism to alleviate this problem, the irrelevant/noisy information still makes it very difficult to address. To overcome such challenge, we thus propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN), for video-language modeling. Different from most existing methods that utilize all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
