Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning
Kaixin zhang, Xiaohe Li, Jiahao Li, Haohua Wu, Xinyu Zhao, Zide Fan, Lei Wang

TL;DR
This paper introduces ClueNet, a novel framework that enhances video reasoning in multi-modal large language models by explicitly extracting and utilizing visual clues, leading to improved accuracy, interpretability, and efficiency in VideoQA tasks.
Contribution
ClueNet is a two-stage supervised fine-tuning framework that explicitly aligns visual clue extraction with reasoning, addressing hallucinations and interpretability issues in video question answering.
Findings
Outperforms state-of-the-art methods by ≥ 1.1% on multiple datasets.
Reduces hallucinations and improves interpretability in VideoQA.
Enhances inference efficiency and generalization across models.
Abstract
Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
