ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities
Terry Yue Zhuo, Yaqing Liao, Yuecheng Lei, Lizhen Qu and, Gerard de Melo, Xiaojun Chang, Yazhou Ren, Zenglin Xu

TL;DR
ViLPAct is a new multimodal benchmark dataset designed to evaluate AI agents' ability to reason and predict human activities based on videos and text, highlighting challenges in compositional generalization.
Contribution
The paper introduces ViLPAct, a comprehensive multimodal benchmark with datasets, test sets, and baseline models for human activity planning tasks.
Findings
Challenges in compositional generalization identified
Multimodal information fusion remains difficult
Baseline models show room for improvement
Abstract
We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from \charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsTest · Balanced Selection
