Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal; Yuchen Liu; Luigi Palmieri; Ilche Georgievski; Marco Aiello

arXiv:2512.15957·cs.CV·December 19, 2025

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal, Yuchen Liu, Luigi Palmieri, Ilche Georgievski, Marco Aiello

PDF

Open Access

TL;DR

This paper introduces CAMP-VLM, a vision-language model that predicts multiple human behaviors from a third-person perspective by integrating visual context and scene graphs, demonstrating significant accuracy improvements.

Contribution

The paper presents a novel framework combining visual and spatial features for multi-human behavior prediction from third-person views, with effective fine-tuning on synthetic data.

Findings

01

CAMP-VLM outperforms baselines by up to 66.9% in accuracy.

02

Synthetic data fine-tuning enables generalization to real-world scenarios.

03

Integration of scene graphs improves behavior prediction accuracy.

Abstract

Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition