From Simple to Professional: A Combinatorial Controllable Image   Captioning Agent

Xinran Wang; Muxi Diao; Baoteng Li; Haiwen Zhang; Kongming Liang and; Zhanyu Ma

arXiv:2412.11025·cs.CV·January 13, 2025

From Simple to Professional: A Combinatorial Controllable Image Captioning Agent

Xinran Wang, Muxi Diao, Baoteng Li, Haiwen Zhang, Kongming Liang and, Zhanyu Ma

PDF

Open Access 1 Repo

TL;DR

This paper introduces CapAgent, a controllable image captioning system that transforms simple user instructions into detailed, professional captions by leveraging multimodal models and external tools, enhancing accuracy and user trust.

Contribution

The paper presents a novel controllable captioning agent that integrates multimodal large language models with external tools for precise, context-aware, and transparent caption generation.

Findings

01

CapAgent effectively transforms simple instructions into detailed captions.

02

The system demonstrates high adherence to specified guidelines.

03

Transparency in reasoning improves user trust.

Abstract

The Controllable Image Captioning Agent (CapAgent) is an innovative system designed to bridge the gap between user simplicity and professional-level outputs in image captioning tasks. CapAgent automatically transforms user-provided simple instructions into detailed, professional instructions, enabling precise and context-aware caption generation. By leveraging multimodal large language models (MLLMs) and external tools such as object detection tool and search engines, the system ensures that captions adhere to specified guidelines, including sentiment, keywords, focus, and formatting. CapAgent transparently controls each step of the captioning process, and showcases its reasoning and tool usage at every step, fostering user trust and engagement. The project code is available at https://github.com/xin-ran-w/CapAgent.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xin-ran-w/capagent
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques