TL;DR
GEMS is a multimodal generation framework that leverages agent-based components with memory and skills to improve performance on complex tasks beyond foundational model limitations.
Contribution
It introduces a structured multi-agent framework with persistent memory and domain-specific skills to enhance multimodal generation on various tasks.
Findings
GEMS outperforms existing models on five main tasks and four downstream tasks.
A lightweight 6B model surpasses state-of-the-art larger models on GenEval2.
The framework achieves significant performance gains across multiple generative backends.
Abstract
Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
