AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Yiming Ren; Zhiqiang Lin; Yu Li; Gao Meng; Weiyun Wang; Junjie Wang; Zicheng Lin; Jifeng Dai; Yujiu Yang; Wenhai Wang; and Ruihang Chu

arXiv:2507.12841·cs.CV·October 29, 2025

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, and Ruihang Chu

PDF

Open Access

TL;DR

The AnyCap Project introduces a comprehensive framework, dataset, and benchmark for controllable omni-modal captioning, significantly improving caption quality and controllability across various models and modalities.

Contribution

It presents a novel lightweight framework (ACM), a large-scale dataset (ACD), and a new evaluation benchmark (AnyCapEval) for enhanced controllable captioning.

Findings

01

ACM improves caption quality across multiple models.

02

ACM-8B boosts GPT-4o's content scores by 45%.

03

The dataset covers three modalities and 28 instruction types.

Abstract

Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimodal Machine Learning Applications