Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

Su Ho Han; Jeongseok Hyun; Pilhyeon Lee; Minho Shim; Dongyoon Wee; Seon Joo Kim

arXiv:2510.19592·cs.CV·April 27, 2026

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim

PDF

1 Repo 1 Video

TL;DR

This paper introduces DecAF, a training-free method that refines attention maps in multimodal large language models to produce accurate video segmentation masks without retraining.

Contribution

DecAF is a novel approach that enhances attention maps via contrastive and frame fusion, enabling effective video segmentation without additional training.

Findings

01

DecAF outperforms existing training-free methods.

02

DecAF achieves comparable results to training-based methods.

03

The method effectively suppresses irrelevant activations and highlights objects.

Abstract

Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hyunjs/DecAF
github

Videos

Decomposed Attention Fusion in MLLMs for Training-free Video Reasoning Segmentation· slideslive