UniAPO: Unified Multimodal Automated Prompt Optimization

Qipeng Zhu; Yanzhe Chen; Huasong Zhong; Yan Li; Jie Chen; Zhixin Zhang; Junping Zhang; Zhenheng Yang

arXiv:2508.17890·cs.CV·August 26, 2025

UniAPO: Unified Multimodal Automated Prompt Optimization

Qipeng Zhu, Yanzhe Chen, Huasong Zhong, Yan Li, Jie Chen, Zhixin Zhang, Junping Zhang, Zhenheng Yang

PDF

TL;DR

UniAPO introduces a unified framework for multimodal prompt optimization that addresses visual token inflation and lack of process supervision, improving performance across text, image, and video tasks.

Contribution

It is the first to tailor automated prompt optimization for multimodal tasks, employing an EM-inspired process and memory mechanisms for enhanced stability and transferability.

Findings

01

Achieves consistent improvements on multimodal benchmarks

02

Addresses visual token inflation and process supervision challenges

03

Demonstrates effectiveness across text, image, and video tasks

Abstract

Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks, such as video-language generation introduces two core challenges: (i) visual token inflation, where long visual token sequences restrict context capacity and result in insufficient feedback signals; (ii) a lack of process-level supervision, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization. We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO. UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.