SimBase: A Simple Baseline for Temporal Video Grounding
Peijun Bao, Alex C. Kot

TL;DR
SimBase introduces a lightweight, simple baseline for temporal video grounding using basic convolutional and element-wise operations, achieving state-of-the-art results and simplifying the evaluation process.
Contribution
This paper demonstrates that a simplified model with minimal complexity can outperform more complex architectures in temporal video grounding.
Findings
Achieves state-of-the-art performance on large-scale datasets.
Uses only lightweight 1D convolutions and element-wise product for fusion.
Simplifies the model design without sacrificing accuracy.
Abstract
This paper presents SimBase, a simple yet effective baseline for temporal video grounding. While recent advances in temporal grounding have led to impressive performance, they have also driven network architectures toward greater complexity, with a range of methods to (1) capture temporal relationships and (2) achieve effective multimodal fusion. In contrast, this paper explores the question: How effective can a simplified approach be? To investigate, we design SimBase, a network that leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures. For cross-modal interaction, SimBase only employs an element-wise product instead of intricate multimodal fusion. Remarkably, SimBase achieves state-of-the-art results on two large-scale datasets. As a simple yet powerful baseline, we hope SimBase will spark new ideas and streamline future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
