TL;DR
This paper introduces DecAF, a training-free method that refines attention maps in multimodal large language models to produce accurate video segmentation masks without retraining.
Contribution
DecAF is a novel approach that enhances attention maps via contrastive and frame fusion, enabling effective video segmentation without additional training.
Findings
DecAF outperforms existing training-free methods.
DecAF achieves comparable results to training-based methods.
The method effectively suppresses irrelevant activations and highlights objects.
Abstract
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
