How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
Shengji Jin, Yuanhao Zou, Victor Zhu, Zhengping Ji, Chen Chen

TL;DR
This study systematically compares three VTG output paradigms across identical models and datasets, revealing that continuous distribution offers the best efficiency-accuracy balance for resource-constrained deployment.
Contribution
It provides a controlled empirical analysis of VTG output paradigms, highlighting the impact on accuracy and efficiency, and offers guidelines for designing deployment-ready systems.
Findings
Continuous distribution paradigm achieves the best efficiency-accuracy trade-off.
Output formulation significantly impacts both accuracy and computational cost.
Choice of output paradigm influences system performance independently of model scale.
Abstract
While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
