ANSEL Photobot: A Robot Event Photographer with Semantic Intelligence
Dmitriy Rivkin, Gregory Dudek, Nikhil Kakodkar, David Meger, Oliver, Limoyo, Xue Liu, Francois Hogan

TL;DR
This paper presents ANSEL Photobot, a robot photographer that uses language and vision models to semantically understand events and capture relevant photos, outperforming existing methods in human evaluations.
Contribution
It introduces a novel approach combining language and vision models for semantic awareness in robotic photography, enabling event-specific photo documentation.
Findings
Generated photo portfolios are rated more appropriate by humans.
The method leverages recent advances in language and vision-language models.
The approach improves semantic relevance of captured photos.
Abstract
Our work examines the way in which large language models can be used for robotic planning and sampling, specifically the context of automated photographic documentation. Specifically, we illustrate how to produce a photo-taking robot with an exceptional level of semantic awareness by leveraging recent advances in general purpose language (LM) and vision-language (VLM) models. Given a high-level description of an event we use an LM to generate a natural-language list of photo descriptions that one would expect a photographer to capture at the event. We then use a VLM to identify the best matches to these descriptions in the robot's video stream. The photo portfolios generated by our method are consistently rated as more appropriate to the event by human evaluators than those generated by existing methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
