Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation
Xiaoxin Lu, Ranran Haoran Zhang, Yusen Zhang, Rui Zhang

TL;DR
This paper introduces a step-by-step framework for generating high-quality, consistent, and coherent text-image plans for daily tasks, leveraging large-scale models and a new benchmark.
Contribution
It proposes a novel iterative framework for text-image plan generation that improves multimodal consistency and coherence, adaptable to various backbone models.
Findings
Effective in enhancing multimodal consistency and coherence.
Outperforms baseline models on a new benchmark.
Framework is compatible with multiple large-scale models.
Abstract
People get informed of a daily task plan through diverse media involving both texts and images. However, most prior research only focuses on LLM's capability of textual plan generation. The potential of large-scale models in providing text-image plans remains understudied. Generating high-quality text-image plans faces two main challenges: ensuring consistent alignment between two modalities and keeping coherence among visual steps. To address these challenges, we propose a novel framework that generates and refines text-image plans step-by-step. At each iteration, our framework (1) drafts the next textual step based on the prediction history; (2) edits the last visual step to obtain the next one; (3) extracts PDDL-like visual information; and (4) refines the draft with the extracted visual information. The textual and visual step produced in stage (4) and (2) will then serve as inputs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
