Unified Static and Dynamic Network: Efficient Temporal Filtering for   Video Grounding

Jingjing Hu; Dan Guo; Kun Li; Zhan Si; Xun Yang; Xiaojun Chang and; Meng Wang

arXiv:2403.14174·cs.CV·April 14, 2025·1 cites

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

Jingjing Hu, Dan Guo, Kun Li, Zhan Si, Xun Yang, Xiaojun Chang and, Meng Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces UniSDNet, a unified network for efficient video grounding that combines static and dynamic modeling inspired by human visual perception, achieving state-of-the-art results and faster inference.

Contribution

The paper proposes a novel unified network architecture that integrates static and dynamic video modeling for improved cross-modal video grounding performance.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Faster inference speed compared to benchmarks.

03

Introduces new datasets for spoken language video grounding.

Abstract

Inspired by the activity-silent and persistent activity mechanisms in human visual perception biology, we design a Unified Static and Dynamic Network (UniSDNet), to learn the semantic association between the video and text/audio queries in a cross-modal environment for efficient video grounding. For static modeling, we devise a novel residual structure (ResMLP) to boost the global comprehensive interaction between the video segments and queries, achieving more effective semantic enhancement/supplement. For dynamic modeling, we effectively exploit three characteristics of the persistent activity mechanism in our network design for a better video context comprehension. Specifically, we construct a diffusely connected video clip graph on the basis of 2D sparse temporal masking to reflect the "short-term effect" relationship. We innovatively consider the temporal distance and relevance as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xian-sh/unisdnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Analysis and Summarization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training · Convolution