Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos
Henghui Du, Chunjie Zhang, Xi Chen, Chang Zhou, Di Hu

TL;DR
VideoDetective introduces a question-aware memory mechanism that iteratively compresses and aggregates critical clues from long videos, enabling efficient long video question-answering with limited context and reduced computational resources.
Contribution
The paper proposes a novel recurrent memory approach for long video question-answering, improving efficiency and accuracy by focusing on crucial information and introducing a new dataset for evaluation.
Findings
Enables processing of 100K tokens with only 32K context length
Achieves significant reduction in GPU memory usage (2 minutes, 37GB)
Outperforms existing methods on multiple long video benchmarks
Abstract
Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The explored problem is meaningful and the motivation of using recurrent memory compression is clear. 2. The recurrent sub-segment processing limits peak memory usage and friendly to real-time video interaction device deployment. 3. The GLVC benchmark is a contribution to the community for comprehensive evaluation of grounded video understanding.
1. The authors claim dynamic compression ratio for different sub-segments to avoid over- or under-compression, but the technical details are not presented. The experiments only show results with different fixed compression ratio on different benchmarks. 2. The recurrent memory compression with history memory continuously added to the context is quite similar to [1], with the only difference in question-aware or not. Due to the lack of dynamic compression ratio, the advantage of question-aware co
+ The core idea is driven by a strong insight: "only a small amount of crucial information is required" to answer the question, making the question-aware filtering strategy sound. + A dataset GLVC for validating the effectiveness of this method was curated. + Good/Competitive performance was achieved.
1) The memory design is query-dependent. However, when question was changed, the build of memory is need again, reducing the proactiveness of the method. 2) Lack of efficiency metrics: A major claim of the paper is efficiency. However, there are no quantitative results comparing the proposed method's efficiency against baselines. This is critical for an LVQA paper. We need to see metrics like inference time against competitive methods (e.g., sparse attention models or other compression techniqu
**Originality** - Dataset Contribution: GLVC provides temporal grounding annotations that could benefit the community for more rigorous evaluation of long video understanding capabilities. **Quality** - Well-Motivated Approach: The method addresses a real limitation of current MLLMs in handling long video contexts due to memory constraints. - Comprehensive Evaluation: The paper evaluates on multiple established benchmarks (VideoMME, MLVU, LongVideoBench, etc.) covering both short and lo
**Fundamental Training Process Contradictions** The paper contains a critical technical inconsistency in Section 3.3: - Claims "memory tokens do not participate in loss calculation" yet the loss function $L = \sum_{j=1}^l -\log p(x_j|V_1, M_1, Q, \cdots, V_S, M_S, Q, x_0, ..., x_{j-1})$ explicitly depends on memory tokens $M_i$ - If memory tokens don't participate in loss calculation, how do they receive gradients for optimization? - This creates a fundamental contradiction that questio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
