Real-Time Visual Attribution Streaming in Thinking Model

Seil Kang; Woojung Han; Junhyeok Kim; Jinyeong Kim; Youngeun Kim; Seong Jae Hwang

arXiv:2604.16587·cs.CV·April 21, 2026

Real-Time Visual Attribution Streaming in Thinking Model

Seil Kang, Woojung Han, Junhyeok Kim, Jinyeong Kim, Youngeun Kim, Seong Jae Hwang

PDF

TL;DR

This paper introduces a lightweight, real-time visual attribution streaming method for multimodal thinking models, enabling users to observe model reasoning grounded in visual evidence instantly.

Contribution

It proposes an amortized approach that estimates causal effects from attention features, achieving faithful, real-time visual attribution without costly computations.

Findings

01

Achieves faithfulness comparable to exhaustive causal methods

02

Enables real-time visual attribution streaming in diverse models

03

Works across five benchmarks and four thinking models

Abstract

We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.