SimBase: A Simple Baseline for Temporal Video Grounding

Peijun Bao; Alex C. Kot

arXiv:2411.07945·cs.CV·November 13, 2024

SimBase: A Simple Baseline for Temporal Video Grounding

Peijun Bao, Alex C. Kot

PDF

Open Access

TL;DR

SimBase introduces a lightweight, simple baseline for temporal video grounding using basic convolutional and element-wise operations, achieving state-of-the-art results and simplifying the evaluation process.

Contribution

This paper demonstrates that a simplified model with minimal complexity can outperform more complex architectures in temporal video grounding.

Findings

01

Achieves state-of-the-art performance on large-scale datasets.

02

Uses only lightweight 1D convolutions and element-wise product for fusion.

03

Simplifies the model design without sacrificing accuracy.

Abstract

This paper presents SimBase, a simple yet effective baseline for temporal video grounding. While recent advances in temporal grounding have led to impressive performance, they have also driven network architectures toward greater complexity, with a range of methods to (1) capture temporal relationships and (2) achieve effective multimodal fusion. In contrast, this paper explores the question: How effective can a simplified approach be? To investigate, we design SimBase, a network that leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures. For cross-modal interaction, SimBase only employs an element-wise product instead of intricate multimodal fusion. Remarkably, SimBase achieves state-of-the-art results on two large-scale datasets. As a simple yet powerful baseline, we hope SimBase will spark new ideas and streamline future…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition