TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Alexandros Stergiou

arXiv:2511.18359·cs.CV·April 8, 2026

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Alexandros Stergiou

PDF

TL;DR

This paper introduces TRANSPORTER, a novel method that uses optimal transport to generate videos reflecting VLM predictions, enhancing interpretability of complex video understanding models.

Contribution

The paper presents a new logits-to-video task and a model-independent approach, TRANSPORTER, for visualizing and understanding VLMs' internal reasoning processes.

Findings

01

TRANSPORTER effectively generates videos that mirror caption attribute changes.

02

Quantitative evaluations show improved interpretability of VLM predictions.

03

Qualitative results demonstrate high-fidelity, semantically meaningful video generation.

Abstract

How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.