G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

Junxian Li; Kai Liu; Zizhong Ding; Zhixin Wang; Zhikai Chen; Renjing Pei; Yulun Zhang

arXiv:2605.12309·cs.CV·May 18, 2026

G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

Junxian Li, Kai Liu, Zizhong Ding, Zhixin Wang, Zhikai Chen, Renjing Pei, Yulun Zhang

PDF

1 Repo

TL;DR

G$^2$TR is a generation-guided visual token reduction method that improves the efficiency of separate-encoder UMMs without sacrificing reasoning or editing capabilities.

Contribution

It introduces a training-free, generation-guided token reduction framework that preserves model capabilities while significantly reducing inference costs.

Findings

01

Reduces visual tokens and computation by 1.94x

02

Maintains reasoning accuracy and editing quality

03

Outperforms baseline methods on benchmarks

Abstract

The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G $^{2}$ TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lijunxian111/G2TR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.