TL;DR
This paper presents a method for zero-shot activity recognition by modeling and inferring action verb attributes from language, enabling the recognition of unseen activities based on their linguistic descriptions.
Contribution
The study introduces a novel approach that learns to infer action attributes from language, improving zero-shot activity recognition beyond prior object-focused methods.
Findings
Action attributes inferred from language improve zero-shot prediction.
The model successfully recognizes unseen activities using linguistic attribute induction.
Language-based attribute inference enhances activity recognition accuracy.
Abstract
In this paper, we investigate large-scale zero-shot activity recognition by modeling the visual and linguistic attributes of action verbs. For example, the verb "salute" has several properties, such as being a light movement, a social act, and short in duration. We use these attributes as the internal mapping between visual and textual representations to reason about a previously unseen action. In contrast to much prior work that assumes access to gold standard attributes for zero-shot classes and focuses primarily on object attributes, our model uniquely learns to infer action attributes from dictionary definitions and distributed word representations. Experimental results confirm that action attributes inferred from language can provide a predictive signal for zero-shot prediction of previously unseen activities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
