From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought

Wentao Tan; Qiong Cao; Yibing Zhan; Chao Xue; Changxing Ding

arXiv:2507.02984·cs.CL·July 29, 2025

From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought

Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding

PDF

Open Access

TL;DR

The paper introduces SMART, a novel self-aligning framework for multimodal reasoning that automatically generates high-quality rationales, including negative ones, to improve model robustness and reasoning ability beyond manual annotation methods.

Contribution

SMART employs answer-oriented chain-of-thought prompts to automatically generate positive and negative rationales, enhancing reasoning and generalization in multimodal large language models.

Findings

01

Models trained with AoT data outperform manual annotations.

02

SMART improves reasoning across various model architectures.

03

Negative rationales boost model robustness.

Abstract

Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methods primarily focus on synthesizing positive rationales, typically relying on manual annotations or complex systems. Moreover, they often overlook negative reasoning, which limits the model's generalization ability and robustness in multimodal inference. To address this gap, we propose a novel framework: \textbf{S}elf-Aligning \textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}iented Chain-of-\textbf{T}hought (SMART). SMART employs an answer-oriented chain-of-thought (AoT) prompt to automatically construct high-quality data. Drawing inspiration from human proof-based strategies, AoT leverages both correct and incorrect answers to extract key visual information that links questions and answers. When provided with correct answers, the model produces strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies