Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Jie Zhu; Yiyang Su; Xiaoming Liu

arXiv:2601.06993·cs.CV·April 14, 2026

Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Jie Zhu, Yiyang Su, Xiaoming Liu

PDF

2 Repos

TL;DR

This paper investigates how textual reasoning affects multi-modal large language models in fine-grained visual classification, revealing that longer reasoning harms accuracy and proposing methods to mitigate this issue.

Contribution

It identifies the 'Cost of Thinking' phenomenon and introduces MRN normalization and ReFine-RFT framework to improve FGVC performance.

Findings

01

Longer textual reasoning reduces classification accuracy.

02

Proposed ReFine-RFT achieves state-of-the-art results on FGVC benchmarks.

03

MRN effectively balances heterogeneous reward signals.

Abstract

Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.