TL;DR
G$^2$TR is a generation-guided visual token reduction method that improves the efficiency of separate-encoder UMMs without sacrificing reasoning or editing capabilities.
Contribution
It introduces a training-free, generation-guided token reduction framework that preserves model capabilities while significantly reducing inference costs.
Findings
Reduces visual tokens and computation by 1.94x
Maintains reasoning accuracy and editing quality
Outperforms baseline methods on benchmarks
Abstract
The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose GTR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
