SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
Yuan Sheng, Yanbin Hao, Chenxu Li, Shuo Wang, Xiangnan He

TL;DR
SeViCES introduces a training-free, model-agnostic framework that enhances long video understanding by selecting and fusing semantic and visual evidence through a consensus mechanism, improving accuracy and robustness.
Contribution
It proposes a novel evidence selection framework combining semantic and visual cues, addressing limitations of existing unimodal and non-temporal methods for long video comprehension.
Findings
Outperforms state-of-the-art methods in accuracy.
Demonstrates robustness across benchmarks.
Effectively fuses semantic and visual evidence.
Abstract
Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Clear and timely problem focus (long video + Video-LLMs). The paper tackles an important and under-served problem: long video QA where naive uniform sampling or brute-force token input is infeasible. The motivation (computational cost, diluted attention, loss of reasoning consistency) is well-argued and backed by recent prior work. 2. Conceptually neat “consensus-driven evidence selection” idea. Combining semantic (caption + LLM reasoning) and visual (embedding + clustering + MI) signals for
1. My major concern is the computation cost and scalability of the proposed LLM-based scoring method. TAS-FS requires two LLM calls per frame (independent + temporal context), plus captioning via BLIP-2 for all frames. Thus the computational cost is significantly higher than other methods with simple LLM usage. 2. The computational cost is not experimented and discussed in this work. It is recommend to compare the runtime cost with some benchmarking methods.
- The paper is easy to follow and well-written. - The proposed method is training-free and model-agnostic, which is easy to extend.
- The paper is motivated by overcoming the "prohibitive computational costs" of processing long videos. However, the proposed SeVICES framework introduces a new, significant, and entirely unmeasured computational bottleneck: inference latency. So, it would be better to provide the complexity analysis of the proposed method. - A primary strength of a "training-free" method should be its universal applicability to any MLLM. However, the paper's experiments are limited to three open-source models.
1. **Training-free and model-agnostic design.** The proposed method can be plugged into multiple VideoLLMs without training. 2. **Consistent performance gains.** SeViCES shows consistent performance gains across multiple video benchmarks, including long video understanding tasks, showing its effectiveness.
1. **Limited novelty.** The proposed SeViCES framework shows limited novelty. The core ideas, (1) LLM-based frame caption scoring and (2) visual feature-based frame clustering for frame selection, have already been explored in VideoTree [1]. As such, both the objective and methodology of SeViCES closely resemble those of VideoTree. A more detailed performance comparison and discussion of their differences and advantages are needed to justify the novelty of this work. [1] Wang et al., Vi
- A significant strength is the proposed dual-branch (semantic and visual) frame selection module, which explicitly addresses the limitation of unimodal approaches by leveraging the complementary strengths of LLM-based reasoning on captions and cluster-guided visual alignment to capture more complete, query-relevant context. - The paper is well-written and easy to follow.
1. Regarding the "Semantic-Visual Consensus Frame Selection" method, the authors argue that traditional CLIP-based methods for measuring text-frame relevance are difficult to apply directly to videos containing temporal information. They instead propose converting frames into captions and using an LLM for assessment. I have the following concerns: - Could the frame-to-caption conversion process itself lead to inaccurate descriptions due to the loss of temporal information? - Since the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
