CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
Yolo Yunlong Tang, Gen Zhan, Li Yang, Yiting Liao, Chenliang Xu

TL;DR
CaRDiff introduces a novel multimodal framework combining language reasoning and diffusion models to improve video saliency prediction, emphasizing the role of language in ranking and interpreting visual attention.
Contribution
The paper presents CaRDiff, integrating large language models, grounding, and diffusion techniques with a new prompting method for enhanced video saliency prediction.
Findings
Outperforms state-of-the-art on MVS dataset
Demonstrates effective zero-shot generalization on DHF1k dataset
Validates the benefit of language-guided reasoning in saliency prediction
Abstract
Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating a multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection
MethodsSoftmax · Attention Is All You Need · Diffusion · Focus
