TL;DR
This paper investigates how textual reasoning affects multi-modal large language models in fine-grained visual classification, revealing that longer reasoning harms accuracy and proposing methods to mitigate this issue.
Contribution
It identifies the 'Cost of Thinking' phenomenon and introduces MRN normalization and ReFine-RFT framework to improve FGVC performance.
Findings
Longer textual reasoning reduces classification accuracy.
Proposed ReFine-RFT achieves state-of-the-art results on FGVC benchmarks.
MRN effectively balances heterogeneous reward signals.
Abstract
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
