Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, Kwonjoon Lee

TL;DR
This paper presents PlausiVL, a large video-language model that anticipates plausible future action sequences by incorporating novel loss functions and logical constraints, improving diversity and realism in predictions.
Contribution
The work introduces a new plausibility-aware training framework for action anticipation using large video-language models, with novel loss functions and logical constraints.
Findings
Improved accuracy on Ego4D and EPIC-Kitchens-100 datasets.
Enhanced diversity and plausibility in generated action sequences.
Effective differentiation between plausible and implausible actions.
Abstract
We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation
