VRAG: Region Attention Graphs for Content-Based Video Retrieval
Kennard Ng, Ser-Nam Lim, Gim Hee Lee

TL;DR
VRAG introduces a region attention graph network that enhances content-based video retrieval by capturing spatio-temporal relations at the region level, achieving state-of-the-art results efficiently.
Contribution
The paper presents VRAG, a novel region-level graph network that models semantic relations in videos, improving retrieval accuracy over existing video-level methods.
Findings
Achieves new state-of-the-art in video retrieval
Shot-level VRAG outperforms other video-level methods
Closer performance to frame-level methods with faster speed
Abstract
Content-based Video Retrieval (CBVR) is used on media-sharing platforms for applications such as video recommendation and filtering. To manage databases that scale to billions of videos, video-level approaches that use fixed-size embeddings are preferred due to their efficiency. In this paper, we introduce Video Region Attention Graph Networks (VRAG) that improves the state-of-the-art of video-level methods. We represent videos at a finer granularity via region-level features and encode video spatio-temporal dynamics through region-level relations. Our VRAG captures the relationships between regions based on their semantic content via self-attention and the permutation invariant aggregation of Graph Convolution. In addition, we show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsConvolution
