Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Mahtab Bigverdi; Zelun Luo; Cheng-Yu Hsieh; Ethan Shen; Dongping Chen,; Linda G. Shapiro; Ranjay Krishna

arXiv:2412.03548·cs.CV·December 10, 2024

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen,, Linda G. Shapiro, Ranjay Krishna

PDF

Open Access

TL;DR

This paper introduces Perception Tokens and a training method called AURORA to enhance multimodal language models' ability to perform visual reasoning tasks by incorporating intrinsic image representations.

Contribution

The paper proposes Perception Tokens and AURORA, a novel training approach that improves visual reasoning in MLMs without extensive finetuning or reliance on external vision tools.

Findings

01

+10.8% on BLINK counting benchmark

02

+11.3% on CVBench

03

+6% on relative depth tasks

Abstract

Multimodal language models (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems