PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding
Lirong Che, Zhenfeng Gan, Yanbo Chen, Junbo Tan, Xueqian Wang

TL;DR
PhotoAgent is a robotic photographer that combines multimodal reasoning and internal simulation to generate high-quality images aligned with aesthetic and spatial goals efficiently.
Contribution
It introduces a novel control paradigm integrating LMM reasoning with internal 3D simulation for creative robotic photography.
Findings
PhotoAgent outperforms baseline methods in spatial reasoning.
It produces higher aesthetic quality images.
The internal world model accelerates the photography process.
Abstract
Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation'' replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
