Video-adverb retrieval with compositional adverb-action embeddings

Thomas Hummel; Otniel-Bogdan Mercea; A. Sophia Koepke; Zeynep Akata

arXiv:2309.15086·cs.CV·September 27, 2023

Video-adverb retrieval with compositional adverb-action embeddings

Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel framework for retrieving adverbs describing actions in videos by aligning video and compositional adverb-action embeddings, achieving state-of-the-art results and generalizing to unseen compositions.

Contribution

The paper presents a new joint embedding approach with residual gating and a specialized training objective for video-adverb retrieval, including benchmarks for unseen adverb-action pairs.

Findings

01

Achieves state-of-the-art performance on five benchmarks.

02

Outperforms prior methods in generalizing to unseen adverb-action compositions.

03

Provides new dataset splits for evaluating unseen adverb-action retrieval.

Abstract

Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalisation task of retrieving adverbs from videos for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ExplainableML/ReGaDa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsResidual gating mechanism to compose adverb-action representations