You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos
Xiang Fang, Daizong Liu, Pan Zhou, Guoshun Nan

TL;DR
This paper introduces a novel compressed-domain approach for temporal sentence grounding in videos, utilizing raw bit-stream features to improve efficiency and effectiveness over traditional methods that rely on fully decoded frames.
Contribution
It proposes the Three-branch Compressed-domain Spatial-temporal Fusion (TCSF) framework that directly exploits compressed video features for temporal grounding, reducing computational complexity and latency.
Findings
TCSF outperforms state-of-the-art methods on three datasets.
The approach achieves higher accuracy with lower computational cost.
Utilizes I-frame, motion vector, and residual features for effective grounding.
Abstract
Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Although previous respectable works have made decent success, they only focus on high-level visual features extracted from the consecutive decoded frames and fail to handle the compressed videos for query modelling, suffering from insufficient representation capability and significant computational complexity during training and testing. In this paper, we pose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input. To handle the raw video bit-stream input, we propose a novel Three-branch Compressed-domain Spatial-temporal Fusion (TCSF) framework, which extracts and aggregates three kinds of low-level visual features (I-frame, motion vector and residual features) for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
Methodsfail
