From Simple to Professional: A Combinatorial Controllable Image Captioning Agent
Xinran Wang, Muxi Diao, Baoteng Li, Haiwen Zhang, Kongming Liang and, Zhanyu Ma

TL;DR
This paper introduces CapAgent, a controllable image captioning system that transforms simple user instructions into detailed, professional captions by leveraging multimodal models and external tools, enhancing accuracy and user trust.
Contribution
The paper presents a novel controllable captioning agent that integrates multimodal large language models with external tools for precise, context-aware, and transparent caption generation.
Findings
CapAgent effectively transforms simple instructions into detailed captions.
The system demonstrates high adherence to specified guidelines.
Transparency in reasoning improves user trust.
Abstract
The Controllable Image Captioning Agent (CapAgent) is an innovative system designed to bridge the gap between user simplicity and professional-level outputs in image captioning tasks. CapAgent automatically transforms user-provided simple instructions into detailed, professional instructions, enabling precise and context-aware caption generation. By leveraging multimodal large language models (MLLMs) and external tools such as object detection tool and search engines, the system ensures that captions adhere to specified guidelines, including sentiment, keywords, focus, and formatting. CapAgent transparently controls each step of the captioning process, and showcases its reasoning and tool usage at every step, fostering user trust and engagement. The project code is available at https://github.com/xin-ran-w/CapAgent.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
