DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition
Raja Kumar, Arka Sadhu, Ram Nevatia

TL;DR
DiVE-k introduces a differential reasoning framework using top-k predictions and reinforcement learning to enhance fine-grained image recognition, outperforming existing methods across multiple datasets.
Contribution
It presents a novel training approach leveraging top-k model outputs for differential reasoning, improving generalization and reducing memorization in fine-grained recognition.
Findings
Significant performance improvements on five fine-grained datasets.
Outperforms existing methods by over 10% in base-to-novel generalization.
Effective in mixed-domain and few-shot scenarios.
Abstract
Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose , fferential isual rasoning using top- generations, framework that leverages model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform…
Peer Reviews
Decision·ICLR 2026 Poster
1. It transfers the open-world classification task into closed-world classification task, which is a promising way to settle the problem of brittle exact string-match reward. 2. The evaluation metrics is reasonable for fine-grained classification task. Previous work typically uses string matching to evaluate the accuracy, while this paper uses the LLM to determine whether the prediction and ground truth belong to the same fine-grained category or not.
1. A direct way to construct the hypotheses set is to select the most similar top-k categories by CLIP text features, and the advantage of the proposed offline option mining lacks experimental support. 2. Since the framework uses a two-step pipeline with chain-of-thought, it incurs additional computational cost due to the requirement of two forward passes. 3. The final accuracy heavily depends on the recall of the first inference step, which is not presented in the experimental results.
- **Formualation**: DiVE-K formulated the task using base model's top-k predictions as a source hard-negative examples to construct the MCQ, and leverages model's reasoning ability with a differentiable reward system, making it a highly effective training method. - **Differentiable Reasoning**- The MCQ format inherently encourages the model to focus on attribute level discriminative reasoning, which beneficial for semantically similar concepts. - **Reward**: Simple reward based on MCQ index se
- **Senstivity to roll outs and MCQ size**: The performance of DiVE-k is heavily relies on K (number of rollouts) and m (Size of the final MCQ). While it is stated some processing to keep consistency, but there is no ablation on how variations in K and m affect the quality of the negative set and final performance. - **Failure cases**: There's systematic analysis/discussions when this differentiable reasoning analysis could fail, because this reasoning in next step relies on base model capac
The primary strength of this paper is its strong empirical performance. The proposed DiVE-k method achieves significant improvements in base-to-novel generalization, mixed-domain, and few-shot settings, as shown in Tables 1, 2, and 3. The qualitative examples in Figures 4 and 9 are also compelling, illustrating that the model learns to perform more detailed differential reasoning when forced to choose from a set of plausible, similar options, which is a key challenge in FGVR.
Despite the strong results, I have major concerns about the methodological choices and novelty of this work. * Limited Novelty: The proposed method's novelty is marginal. At its core, it is a two-step process: 1) an offline data curation step that converts an open-ended generation task into a multiple-choice classification task, and 2) a standard RL training step (GRPO) on this new task. The "differential reasoning" appears to be a direct consequence of this prompt reformatting (from open-ended
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
