VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large   Multi-Modal Models

Zejun Li; Ruipu Luo; Jiwen Zhang; Minghui Qiu; Xuanjing Huang; Zhongyu; Wei

arXiv:2405.16919·cs.CV·March 11, 2025

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, Zhongyu, Wei

PDF

Open Access 1 Repo 1 Video

TL;DR

VoCoT introduces a multi-step, object-centric reasoning framework for large multi-modal models, significantly enhancing their ability to perform complex visual reasoning tasks by bridging modality gaps through grounded object representations.

Contribution

This work presents VoCoT, a novel multi-step reasoning framework that improves multi-modal models' complex reasoning by integrating object-centric, visually grounded representations.

Findings

01

VolCano, a 7B parameter model, outperforms SOTA models on CLEVR and EmbSpatial benchmarks.

02

VoCoT effectively bridges modality gaps in multi-modal reasoning tasks.

03

The approach demonstrates strong performance with limited input resolution.

Abstract

While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. To adapt LMMs in reasoning with VoCoT, we further construct an instruction-tuning dataset. By combining VoCoT with the prevalent open-source LMM architectures, we develop a VoCoT-based model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rupertluo/vocot
pytorchOfficial

Videos

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models· underline

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling