CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
Jingyu Lei, Gaoang Wang, Der-Horng Lee

TL;DR
CORE introduces object-centric visual token compression using semantic segmentation and centroid-guided sorting, significantly reducing computational costs while maintaining high performance in large vision-language models.
Contribution
It proposes a novel object-centric token merging paradigm with a segmentation decoder and centroid-guided sorting, improving efficiency and semantic preservation in LVLMs.
Findings
State-of-the-art on six benchmarks
Maintains 97.4% performance with only 2.2% tokens
Dramatic efficiency gains in adaptive-rate settings
Abstract
Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
