Video-KTR: Reinforcing Video Reasoning via Key Token Attribution
Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, Xudong Jiang

TL;DR
Video-KTR introduces a fine-grained, token-level reinforcement learning framework that enhances video reasoning accuracy and interpretability by focusing on key visual, temporal, and uncertain tokens, achieving state-of-the-art results.
Contribution
It presents a novel modality-aware policy shaping method that selectively reinforces informative tokens based on multiple attribution signals for improved video reasoning.
Findings
Achieves 42.7% on Video-Holmes benchmark, surpassing GPT-4o.
Demonstrates consistent improvements across multiple reasoning and understanding tasks.
Ablation studies confirm the effectiveness of combined attribution signals.
Abstract
Reinforcement learning (RL) has shown strong potential for enhancing reasoning in multimodal large language models, yet existing video reasoning methods often rely on coarse sequence-level rewards or single-factor token selection, neglecting fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, limiting both accuracy and interpretability. We propose Video-KTR, a modality-aware policy shaping framework that performs selective, token-level RL by combining three attribution signals: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. By reinforcing only these key tokens, Video-KTR focuses learning on semantically informative, modality-sensitive content while filtering…
Peer Reviews
Decision·ICLR 2026 Poster
1. The general idea is simple and insightful by letting the model reinforce on 3 types of tokens that are crucial to video reasoning, but effective that has performance gain over 5 video reasoning benchmarks. 2. Clear ablation study of all combination of 3 types of tokens to show clearly what type of token matters the most and how each type of tokens contribute to the performance gain. 3. 7 research questions are insightful. For examples, by using the same dataset and training recipe, they made
1. The performance gain over the Vanilla GRPO for three types of tokens, despite being higher, but it's a small increase. When enabled all three types, the average delta performance gain is just 2.4% for the average of three benchmark, while other ablations show even smaller differences. Therefore, the effectiveness of the Video-KTR is limited from the results.
1. The paper explores which tokens are more important to be optimized to enhance the reasoning process. This is a cheap and useful technique in general. 2. The paper is well-structured and clearly written, showing the background, proposed methods and experimental setup clearly. 3. The results seem promising as the token selection method yields better performance on reasoning-centric tasks such as video-Holmes. Ablation studies and various plots were made to support the findings.
1. I am a bit concerned regarding the theoretical foundation of this work. What is the actual contribution of the tokens that has actually been masked? For example, would these tokens introduce unwanted noise to the gradient w.r.t. the logits. A theoretical derivation by looking at the actual influence with or without the mask in the loss when taking the gradient w.r.t. the logits or even deeper in the network should be provided. __Moreover__, tokens that are visual-aware or temporal-aware, does
Clear explainations and good experiments. The extent of empirical investigations is commendable -- I am not an expert so can not comment on the strength of the experiments.
1) The comparision to the closed source models is from almost 2 generations ago and would be good to have latest to have a better understanding of where things stand. 2) While the heursitics are working great for this task at hand, given the past of DL and my experience, the heuristics stop helping in general purpose cases and while we scale up. It is a strong result for now in a narrow domain but a broader investigation is what will concretize the proposed things in the modern RL pipelines. I
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
