MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

Pengcheng Fang; Yuxia Chen; Xiaohao Cai

arXiv:2604.25886·cs.MM·May 5, 2026

MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

Pengcheng Fang, Yuxia Chen, Xiaohao Cai

PDF

TL;DR

MarkIt is a training-free framework that enhances video language models' ability to precisely localize events in untrimmed videos by transforming videos into query-conditioned marked videos with explicit visual cues.

Contribution

It introduces a novel, training-free annotation-free method that converts videos into query-conditioned marked videos, improving temporal grounding accuracy without modifying existing models.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Improves temporal localization consistency across models.

03

Requires no training or fine-tuning of Vid-LLMs.

Abstract

Video temporal grounding (VTG) aims to localize the start and end timestamps of the event described by a given query within an untrimmed video. Despite the strong open-world video understanding and recognition ability of video language large models (Vid-LLMs), outputting precise temporal grounding information remains challenging, since explicit temporal cues are scarce in untrimmed videos, and query-relevant entities are hard to track consistently across the video timeline. In this paper, we present \MarkIt{}, a training-free framework that transforms an input video into a query-conditioned marked video, which empowers Vid-LLMs to generate more reliable temporal localization predictions. The core component of \MarkIt{} is an annotation-free query-to-mask grounding bridge (Q2M-Bridge). Given a natural-language query, it automatically derives a compact set of canonical subject tags…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.