TL;DR
BookAgent is a multi-agent framework that generates safe, coherent visual storybooks from user drafts by jointly planning, illustrating, and repairing inconsistencies, advancing multi-modal narrative creation.
Contribution
It introduces a holistic, safety-aware multi-agent system for end-to-end storybook synthesis, addressing multi-modal grounding and child-specific safety constraints.
Findings
Outperforms existing methods in narrative coherence and visual consistency.
Ensures safety compliance in multi-modal story generation.
Effectively calibrates multi-modal alignment and global storytelling consistency.
Abstract
Recent advancements in Large Generative Models (LGMs) have revolutionized multi-modal generation. However, generating illustrated storybooks remains an open challenge, where prior works mainly decompose this task into separate stages, and thus, holistic multi-modal grounding remains limited. Besides, while safety alignment is studied for text- or image-only generation, existing works rarely integrate child-specific safety constraints into narrative planning and sequence-level multi-modal verification. To address these limitations, we propose BookAgent, a safety-aware multi-agent collaboration framework designed for high-quality, safety-aware visual narratives. Different from prior story visualization models that assume a fixed storyline sequence, BookAgent targets end-to-end storybook synthesis from a user draft by jointly planning, scripting, illustrating, and globally repairing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
