LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed, Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer,, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan

TL;DR
This paper introduces a comprehensive framework for evaluating and improving step-by-step visual reasoning in large language models, including a new benchmark, a novel metric, and a multimodal model called LlamaV-o1.
Contribution
It presents a new visual reasoning benchmark, a step-level reasoning quality metric, and a multimodal model trained with curriculum learning for enhanced multi-step reasoning.
Findings
LlamaV-o1 outperforms existing open-source models.
LlamaV-o1 achieves 67.3 average score, 3.8% higher than recent models.
LlamaV-o1 is 5 times faster during inference.
Abstract
Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Biomedical Text Mining and Ontologies
MethodsSparse Evolutionary Training
