E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

Jiahao Nie; Wenbin An; Gongjie Zhang; Yicheng Xu; Yap-Peng Tan; Alex C. Kot; Shijian Lu

arXiv:2602.05215·cs.CV·February 6, 2026

E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

Jiahao Nie, Wenbin An, Gongjie Zhang, Yicheng Xu, Yap-Peng Tan, Alex C. Kot, Shijian Lu

PDF

Open Access

TL;DR

E.M.Ground is a novel Video Large Language Model designed for more accurate temporal video grounding by capturing holistic event semantics, reducing noise, and improving event matching through innovative token and feature aggregation techniques.

Contribution

It introduces a <event> token, smoothing techniques, and multi-grained feature aggregation to enhance event perception and matching in temporal video grounding tasks.

Findings

01

Outperforms state-of-the-art Vid-LLMs on benchmark datasets

02

Achieves significant improvements in event localization accuracy

03

Demonstrates robustness to noise and information loss

Abstract

Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special <event> token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Human Pose and Action Recognition · Multimodal Machine Learning Applications