Predicting Motivations of Actions by Leveraging Text
Carl Vondrick, Deniz Oktay, Hamed Pirsiavash, Antonio Torralba

TL;DR
This paper explores predicting human motivations behind actions in images by combining visual data with knowledge from language models, introducing a new dataset and demonstrating that language-based knowledge transfer can improve understanding.
Contribution
The paper introduces a new dataset of actions with motivations and proposes leveraging language models to incorporate experiential knowledge into action understanding.
Findings
Language models help improve motivation prediction in images.
Knowledge transfer from text enhances action understanding.
The dataset enables future research in motivation-aware vision tasks.
Abstract
Understanding human actions is a key problem in computer vision. However, recognizing actions is only the first step of understanding what a person is doing. In this paper, we introduce the problem of predicting why a person has performed an action in images. This problem has many applications in human activity understanding, such as anticipating or explaining an action. To study this problem, we introduce a new dataset of people performing actions annotated with likely motivations. However, the information in an image alone may not be sufficient to automatically solve this task. Since humans can rely on their lifetime of experiences to infer motivation, we propose to give computer vision systems access to some of these experiences by using recently developed natural language models to mine knowledge stored in massive amounts of text. While we are still far away from fully understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
