TL;DR
This paper introduces a novel weakly supervised learning approach to recognize and understand adverbs in instructional videos by modeling their effects as transformations in an embedding space, improving video-to-adverb retrieval.
Contribution
The paper proposes a new method to learn adverb representations from weakly supervised videos, using attention and embedding transformations, with no prior work addressing adverbs in this context.
Findings
Achieved 0.719 mAP in video-to-adverb retrieval.
Demonstrated the ability to attend to relevant video parts for adverb recognition.
Outperformed all baseline methods in the task.
Abstract
We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations. Key to our method is the fact that the visual representation of the adverb is highly dependant on the action to which it applies, although the same adverb will modify multiple actions in a similar way. For instance, while 'spread quickly' and 'mix quickly' will look dissimilar, we can learn a common representation that allows us to recognize both, among other actions. We formulate this as an embedding problem, and use scaled dot-product attention to learn from weakly-supervised video narrations. We jointly learn adverbs as invertible transformations operating on the embedding space, so as to add or remove the effect of the adverb. As there is no prior work on weakly supervised learning from adverbs, we gather paired action-adverb annotations from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Action Modifiers: Learning From Adverbs in Instructional Videos· youtube
Taxonomy
MethodsSoftmax · Attention Is All You Need
