HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Dan Ben-Ami; Gabriele Serussi; Kobi Cohen; Chaim Baskin

arXiv:2603.18558·cs.CV·March 20, 2026

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

PDF

Open Access

TL;DR

HiMu is a novel, training-free framework for long video question answering that hierarchically selects relevant frames by decomposing queries into atomic predicates and combining multimodal signals, improving efficiency and accuracy.

Contribution

HiMu introduces a hierarchical, logic-based frame selection method that bridges the gap between similarity-based and agent-based approaches without additional training.

Findings

01

Outperforms existing selectors at 16 frames with Qwen3-VL 8B.

02

Surpasses agentic systems at 32-512 frames with significantly fewer FLOPs.

03

Enhances the efficiency-accuracy trade-off in long video QA.

Abstract

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning