ClawMachine: Learning to Fetch Visual Tokens for Referential   Comprehension

Tianren Ma; Lingxi Xie; Yunjie Tian; Boyu Yang; Qixiang Ye

arXiv:2406.11327·cs.CV·January 24, 2025

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

Tianren Ma, Lingxi Xie, Yunjie Tian, Boyu Yang, Qixiang Ye

PDF

Open Access 1 Repo

TL;DR

ClawMachine introduces a novel approach for vision-language alignment in multimodal models by using token collectives for entities and a hybrid perception mechanism, improving scene understanding and referential tasks without extra syntax.

Contribution

It proposes a new methodology that explicitly notates entities with token collectives and unifies visual referential tasks using a joint vocabulary, enhancing efficiency and scalability.

Findings

01

Achieves superior performance on scene-level tasks

02

Demonstrates effective integration of multi-source information

03

Shows potential for complex visual reasoning

Abstract

Aligning vision and language concepts at a finer level remains an essential topic of multimodal large language models (MLLMs), particularly for tasks such as referring and grounding. Existing methods, such as proxy encoding and geometry encoding, incorporate additional syntax to encode spatial information, imposing extra burdens when communicating between language and vision modules. In this study, we propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens that collaboratively represent higher level semantics. A hybrid perception mechanism is also explored to perceive and understand scenes from both discrete and continuous spaces. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. By leveraging a joint vision-language vocabulary, ClawMachine further integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

martian422/clawmachine
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems