PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Lirong Che; Zhenfeng Gan; Yanbo Chen; Junbo Tan; Xueqian Wang

arXiv:2603.22796·cs.CV·March 25, 2026

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Lirong Che, Zhenfeng Gan, Yanbo Chen, Junbo Tan, Xueqian Wang

PDF

Open Access

TL;DR

PhotoAgent is a robotic photographer that combines multimodal reasoning and internal simulation to generate high-quality images aligned with aesthetic and spatial goals efficiently.

Contribution

It introduces a novel control paradigm integrating LMM reasoning with internal 3D simulation for creative robotic photography.

Findings

01

PhotoAgent outperforms baseline methods in spatial reasoning.

02

It produces higher aesthetic quality images.

03

The internal world model accelerates the photography process.

Abstract

Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation'' replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis