TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs
Zhehan Kan, Yanlin Liu, Kun Yin, Xinghua Jiang, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Qingmin Liao, Wenming Yang

TL;DR
TACO is a reinforcement learning algorithm that enhances visual reasoning in large vision-language models by ensuring answer consistency, stabilizing long-chain reasoning, and improving data efficiency through adaptive strategies.
Contribution
It introduces Think-Answer Consistency, Rollback Resample Strategy, and adaptive learning schedules to improve reasoning stability and data efficiency in LVLMs.
Findings
Significant performance improvements on REC and VQA benchmarks.
Enhanced stability and reasoning accuracy in long-chain exploration.
Improved data efficiency through adaptive sampling strategies.
Abstract
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs). While recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings, they face limitations, including inconsistencies between reasoning and final answers, model instability and crashes during long-chain exploration, and low data learning efficiency. To address these challenges, we propose TACO, a novel reinforcement learning algorithm for visual reasoning. Building on Generalized Reinforcement Policy Optimization (GRPO), TACO introduces Think-Answer Consistency, which tightly couples reasoning with answer consistency to ensure answers are grounded in thoughtful reasoning. We also introduce the Rollback Resample Strategy, which adaptively removes problematic samples and reintroduces them to the sampler, enabling stable long-chain exploration and future learning…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The idea of enhancing RL training stability and robustness through MKS and ADS is practically useful. 2. This paper supports its claims regarding MKS and ADS with empirical evidence, including experimental logs during training in Fig. 3 and performance on various REC (Tab. 1) and VQA (Tab. 3 and 5) datasets.
[Major Weakness] 1. Although the paper claims to target ineffective reasoning, the TAC design feels less novel for VQA (since it only directly uses a LVLM as an reward model), and the TAC design for REC, especially how it captures reasoning consistency, remains vague (which would be further elaborated in 3.), weakening the conceptual appeal of the method. 2. The proposed MKS and ADS components are general RL stabilization strategies, not strictly tied to GRPO, and it remains unclear whether thei
1. The paper is well-organized, and the writing is clear. 2. The experiments include a variety of datasets, such as Video VQA, VQA, and REC, which cover a wide range of multimodal scenarios. 3. The main point of this paper is intriguing; it introduces the concept of consistency supervision between thinking and answering, which I find appealing. It would be even better to provide dense rewards at the semantic level or within a continuous space, rather than relying solely on rule-based answer re
1. The paper claims that three failure modes arise from a systemic breakdown in consistency across semantic, optimization, and learning levels; however, if I do not miss something, the paper lacks the motivation and empirical evidence to support this perspective. 2. The paper focuses solely on comparing the RL method, such as GRPO, and proposes improvements to this approach. However, many other RL methods exist, including classic techniques like DPO and PPO, as well as GRPO variants such as DAP
1. The paper is well-organized. 2. The paper demonstrates that the TACO method is effective.
**Writing-wise:** 1. The paper does not provide sufficient justification early on for the assumption that accuracy equates to stability. 2. The paper devotes considerable space to demonstrating that TACO improves accuracy, but offers insufficient discussion on stability, the motivation highlighted in the introduction. **Method-wise:** 1. The paper proposes three techniques based on three forms of consistency. However, as shown in Figure 3, the TAC mechanism actually increases training instab
Well-motivated problem formulation: The paper clearly articulates three distinct failure modes in LVLM training and frames them as consistency failures across semantic, optimization, and learning levels. This unified perspective is valuable. Comprehensive experimental validation: The evaluation spans diverse tasks (REC, VQA, Video VQA) and includes both in-domain and out-of-domain benchmarks, demonstrating broad applicability. Thorough ablation studies: Table 7 and Table 8 provide useful ablatio
Limited technical novelty: While the combination is novel, the individual components lack significant innovation: TAC for REC is simply the IoU of three bounding boxes—a straightforward geometric constraint MKS resembles standard experience replay with adaptive thresholding ADS is curriculum learning with fixed percentile thresholds The paper would benefit from clearer articulation of what is technically novel beyond the combination Circular dependency on external supervisor: For VQA tasks
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, Reasoning, and Knowledge · Multi-Agent Systems and Negotiation · Semantic Web and Ontologies
