CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video   Temporal Grounding

Zhijian Hou; Wanjun Zhong; Lei Ji; Difei Gao; Kun Yan; Wing-Kwong; Chan; Chong-Wah Ngo; Zheng Shou; Nan Duan

arXiv:2209.10918·cs.CV·June 1, 2023

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong, Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan

PDF

Open Access 1 Repo

TL;DR

CONE is a novel framework that improves long video temporal grounding by combining a query-guided window selection with a coarse-to-fine alignment strategy, significantly enhancing efficiency and accuracy.

Contribution

The paper introduces a plug-and-play CONE framework that leverages a query-guided sliding window and contrastive learning for better multi-modal alignment in long videos.

Findings

01

Achieves state-of-the-art results on two long VTG benchmarks.

02

Speeds up inference by 2x on Ego4D-NLQ and 15x on MAD.

03

Improves performance from 3.13% to 6.87% on MAD.

Abstract

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

houzhijian/cone
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Learning