TL;DR
SteerSeg enhances video object segmentation from natural language by steering attention maps through learnable prompts and reasoning-guided prompts, improving spatial grounding without retraining large models.
Contribution
It introduces a novel attention steering method using soft prompts and Chain-of-Thought prompting to improve spatial localization in video segmentation.
Findings
Significantly improves grounding accuracy on multiple benchmarks.
Maintains pretrained reasoning capabilities while enhancing spatial localization.
Generalizes well across diverse video segmentation datasets.
Abstract
Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
