No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention   and Zoom-in Boundary Detection

Qi Zhang; Sipeng Zheng; Qin Jin

arXiv:2307.10567·cs.CV·July 21, 2023

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Qi Zhang, Sipeng Zheng, Qin Jin

PDF

Open Access

TL;DR

This paper introduces a lightweight, no-frills temporal video grounding model that uses multi-scale neighboring attention and zoom-in boundary detection to improve accuracy and speed in localizing language queries in videos.

Contribution

The paper presents a novel, simple TVG model with multi-scale neighboring attention and zoom-in boundary detection, achieving competitive results with faster inference and fewer parameters.

Findings

01

Achieves competitive performance on TVG benchmarks.

02

Faster inference speed due to lightweight architecture.

03

Effective in low SNR scenarios with minimal noise.

Abstract

Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video. A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR. Prior works have addressed this challenge using sophisticated techniques. In this paper, we propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection. The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, enabling the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment. With an end-to-end training strategy, our model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings