Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Kaixin zhang; Xiaohe Li; Jiahao Li; Haohua Wu; Xinyu Zhao; Zide Fan; Lei Wang

arXiv:2603.15008·cs.CV·March 17, 2026

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

Kaixin zhang, Xiaohe Li, Jiahao Li, Haohua Wu, Xinyu Zhao, Zide Fan, Lei Wang

PDF

Open Access

TL;DR

This paper introduces ClueNet, a novel framework that enhances video reasoning in multi-modal large language models by explicitly extracting and utilizing visual clues, leading to improved accuracy, interpretability, and efficiency in VideoQA tasks.

Contribution

ClueNet is a two-stage supervised fine-tuning framework that explicitly aligns visual clue extraction with reasoning, addressing hallucinations and interpretability issues in video question answering.

Findings

01

Outperforms state-of-the-art methods by ≥ 1.1% on multiple datasets.

02

Reduces hallucinations and improves interpretability in VideoQA.

03

Enhances inference efficiency and generalization across models.

Abstract

Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning