Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

Jingjing Jiang; Chao Ma; Xurui Song; Hanwang Zhang; Jun Luo

arXiv:2507.07424·cs.CV·July 11, 2025

Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

Jingjing Jiang, Chao Ma, Xurui Song, Hanwang Zhang, Jun Luo

PDF

Open Access

TL;DR

Corvid is a multimodal large language model enhanced with chain-of-thought reasoning, utilizing a hybrid vision encoder, specialized training datasets, and inference strategies to improve complex reasoning tasks like mathematics and science problem-solving.

Contribution

The paper introduces Corvid, a novel MLLM with improved CoT reasoning through a hybrid vision encoder, a new instruction dataset, and a two-stage training and inference approach.

Findings

01

Outperforms existing MLLMs in reasoning tasks

02

Excels in mathematical and science problem-solving

03

Effective self-verification during inference

Abstract

Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid's CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling