Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos
Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou

TL;DR
This paper introduces an Adaptive Proposal Generation Network (APGN) that combines the efficiency of bottom-up approaches with segment-level interaction, significantly improving temporal sentence localization in videos.
Contribution
The paper proposes a novel APGN that adaptively generates proposals by foreground-background classification, reducing redundancy and enhancing semantic quality, thus outperforming existing methods.
Findings
APGN achieves state-of-the-art results on three benchmarks.
The method reduces redundant proposals compared to traditional top-down approaches.
Semantic quality of proposals is significantly improved.
Abstract
We address the problem of temporal sentence localization in videos (TSLV). Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals. Although they have achieved decent performance, the proposals are handcrafted and redundant. Recently, bottom-up framework attracts increasing attention due to its superior efficiency. It directly predicts the probabilities for each frame as a boundary. However, the performance of bottom-up model is inferior to the top-down counterpart as it fails to exploit the segment-level interaction. In this paper, we propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency. Specifically, we first perform a foreground-background classification upon the video and regress on the foreground frames to adaptively generate proposals. In this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
