Video-adverb retrieval with compositional adverb-action embeddings
Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

TL;DR
This paper introduces a novel framework for retrieving adverbs describing actions in videos by aligning video and compositional adverb-action embeddings, achieving state-of-the-art results and generalizing to unseen compositions.
Contribution
The paper presents a new joint embedding approach with residual gating and a specialized training objective for video-adverb retrieval, including benchmarks for unseen adverb-action pairs.
Findings
Achieves state-of-the-art performance on five benchmarks.
Outperforms prior methods in generalizing to unseen adverb-action compositions.
Provides new dataset splits for evaluating unseen adverb-action retrieval.
Abstract
Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalisation task of retrieving adverbs from videos for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsResidual gating mechanism to compose adverb-action representations
