Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models
Utsav Panchal, Yuchen Liu, Luigi Palmieri, Ilche Georgievski, Marco Aiello

TL;DR
This paper introduces CAMP-VLM, a vision-language model that predicts multiple human behaviors from a third-person perspective by integrating visual context and scene graphs, demonstrating significant accuracy improvements.
Contribution
The paper presents a novel framework combining visual and spatial features for multi-human behavior prediction from third-person views, with effective fine-tuning on synthetic data.
Findings
CAMP-VLM outperforms baselines by up to 66.9% in accuracy.
Synthetic data fine-tuning enables generalization to real-world scenarios.
Integration of scene graphs improves behavior prediction accuracy.
Abstract
Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition
