CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

Yolo Yunlong Tang; Gen Zhan; Li Yang; Yiting Liao; Chenliang Xu

arXiv:2408.12009·cs.CV·October 9, 2025

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

Yolo Yunlong Tang, Gen Zhan, Li Yang, Yiting Liao, Chenliang Xu

PDF

Open Access 1 Video

TL;DR

CaRDiff introduces a novel multimodal framework combining language reasoning and diffusion models to improve video saliency prediction, emphasizing the role of language in ranking and interpreting visual attention.

Contribution

The paper presents CaRDiff, integrating large language models, grounding, and diffusion techniques with a new prompting method for enhanced video saliency prediction.

Findings

01

Outperforms state-of-the-art on MVS dataset

02

Demonstrates effective zero-shot generalization on DHF1k dataset

03

Validates the benefit of language-guided reasoning in saliency prediction

Abstract

Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating a multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion· underline

Taxonomy

TopicsVisual Attention and Saliency Detection

MethodsSoftmax · Attention Is All You Need · Diffusion · Focus