ViLPAct: A Benchmark for Compositional Generalization on Multimodal   Human Activities

Terry Yue Zhuo; Yaqing Liao; Yuecheng Lei; Lizhen Qu and; Gerard de Melo; Xiaojun Chang; Yazhou Ren; Zenglin Xu

arXiv:2210.05556·cs.CV·March 10, 2023

ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities

Terry Yue Zhuo, Yaqing Liao, Yuecheng Lei, Lizhen Qu and, Gerard de Melo, Xiaojun Chang, Yazhou Ren, Zenglin Xu

PDF

Open Access

TL;DR

ViLPAct is a new multimodal benchmark dataset designed to evaluate AI agents' ability to reason and predict human activities based on videos and text, highlighting challenges in compositional generalization.

Contribution

The paper introduces ViLPAct, a comprehensive multimodal benchmark with datasets, test sets, and baseline models for human activity planning tasks.

Findings

01

Challenges in compositional generalization identified

02

Multimodal information fusion remains difficult

03

Baseline models show room for improvement

Abstract

We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from \charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsTest · Balanced Selection