VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, Zhongyu, Wei

TL;DR
VoCoT introduces a multi-step, object-centric reasoning framework for large multi-modal models, significantly enhancing their ability to perform complex visual reasoning tasks by bridging modality gaps through grounded object representations.
Contribution
This work presents VoCoT, a novel multi-step reasoning framework that improves multi-modal models' complex reasoning by integrating object-centric, visually grounded representations.
Findings
VolCano, a 7B parameter model, outperforms SOTA models on CLEVR and EmbSpatial benchmarks.
VoCoT effectively bridges modality gaps in multi-modal reasoning tasks.
The approach demonstrates strong performance with limited input resolution.
Abstract
While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. To adapt LMMs in reasoning with VoCoT, we further construct an instruction-tuning dataset. By combining VoCoT with the prevalent open-source LMM architectures, we develop a VoCoT-based model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling
