HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering
Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

TL;DR
HiMu is a novel, training-free framework for long video question answering that hierarchically selects relevant frames by decomposing queries into atomic predicates and combining multimodal signals, improving efficiency and accuracy.
Contribution
HiMu introduces a hierarchical, logic-based frame selection method that bridges the gap between similarity-based and agent-based approaches without additional training.
Findings
Outperforms existing selectors at 16 frames with Qwen3-VL 8B.
Surpasses agentic systems at 32-512 frames with significantly fewer FLOPs.
Enhances the efficiency-accuracy trade-off in long video QA.
Abstract
Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
