Can't make an Omelette without Breaking some Eggs: Plausible Action   Anticipation using Large Video-Language Models

Himangi Mittal; Nakul Agarwal; Shao-Yuan Lo; Kwonjoon Lee

arXiv:2405.20305·cs.CV·May 31, 2024

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, Kwonjoon Lee

PDF

Open Access

TL;DR

This paper presents PlausiVL, a large video-language model that anticipates plausible future action sequences by incorporating novel loss functions and logical constraints, improving diversity and realism in predictions.

Contribution

The work introduces a new plausibility-aware training framework for action anticipation using large video-language models, with novel loss functions and logical constraints.

Findings

01

Improved accuracy on Ego4D and EPIC-Kitchens-100 datasets.

02

Enhanced diversity and plausibility in generated action sequences.

03

Effective differentiation between plausible and implausible actions.

Abstract

We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation