OVG-HQ: Online Video Grounding with Hybrid-modal Queries

Runhao Zeng; Jiaqi Mao; Minghao Lai; Minh Hieu Phan; Yanjie Dong; Wei Wang; Qi Chen; Xiping Hu

arXiv:2508.11903·cs.CV·August 19, 2025

OVG-HQ: Online Video Grounding with Hybrid-modal Queries

Runhao Zeng, Jiaqi Mao, Minghao Lai, Minh Hieu Phan, Yanjie Dong, Wei Wang, Qi Chen, Xiping Hu

PDF

Open Access

TL;DR

This paper introduces OVG-HQ, a new online video grounding task with hybrid-modal queries, proposing a unified model and dataset to handle online, multi-modal video localization challenges effectively.

Contribution

We propose OVG-HQ-Unify, a novel framework with a Parametric Memory Block and cross-modal distillation to improve hybrid-modal online video grounding.

Findings

01

Our model outperforms existing methods in accuracy and efficiency.

02

The new dataset QVHighlights-Unify enables comprehensive evaluation.

03

Adapted online metrics effectively measure real-time performance.

Abstract

Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that retain previously learned knowledge to enhance current decision and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques