PerspAct: Enhancing LLM Situated Collaboration Skills through Perspective Taking and Active Vision

Sabrina Patania; Luca Annese; Anita Pellegrini; Silvia Serino; Anna Lambiase; Luca Pallonetto; Silvia Rossi; Simone Colombani; Tom Foulsham; Azzurra Ruggeri; Dimitri Ognibene

arXiv:2511.08098·cs.RO·November 12, 2025

PerspAct: Enhancing LLM Situated Collaboration Skills through Perspective Taking and Active Vision

Sabrina Patania, Luca Annese, Anita Pellegrini, Silvia Serino, Anna Lambiase, Luca Pallonetto, Silvia Rossi, Simone Colombani, Tom Foulsham, Azzurra Ruggeri, Dimitri Ognibene

PDF

Open Access

TL;DR

This paper introduces PerspAct, a method that combines perspective-taking and active vision to improve large language models' ability to collaborate effectively in multi-agent scenarios, especially in complex environments requiring viewpoint reasoning.

Contribution

The study extends the Director task with active visual exploration and demonstrates that explicit perspective cues and active strategies enhance LLM interpretative accuracy and collaboration.

Findings

01

Active exploration improves interpretative accuracy.

02

Explicit perspective cues enhance collaboration.

03

ReAct-style reasoning benefits multi-agent understanding.

Abstract

Recent advances in Large Language Models (LLMs) and multimodal foundation models have significantly broadened their application in robotics and collaborative systems. However, effective multi-agent interaction necessitates robust perspective-taking capabilities, enabling models to interpret both physical and epistemic viewpoints. Current training paradigms often neglect these interactive contexts, resulting in challenges when models must reason about the subjectivity of individual perspectives or navigate environments with multiple observers. This study evaluates whether explicitly incorporating diverse points of view using the ReAct framework, an approach that integrates reasoning and acting, can enhance an LLM's ability to understand and ground the demands of other agents. We extend the classic Director task by introducing active visual exploration across a suite of seven scenarios of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human-Automation Interaction and Safety