GEMS: Agent-Native Multimodal Generation with Memory and Skills

Zefeng He; Siyuan Huang; Xiaoye Qu; Yafu Li; Tong Zhu; Yu Cheng; Yang Yang

arXiv:2603.28088·cs.CV·March 31, 2026

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang

PDF

1 Repo

TL;DR

GEMS is a multimodal generation framework that leverages agent-based components with memory and skills to improve performance on complex tasks beyond foundational model limitations.

Contribution

It introduces a structured multi-agent framework with persistent memory and domain-specific skills to enhance multimodal generation on various tasks.

Findings

01

GEMS outperforms existing models on five main tasks and four downstream tasks.

02

A lightweight 6B model surpasses state-of-the-art larger models on GenEval2.

03

The framework achieves significant performance gains across multiple generative backends.

Abstract

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lcqysl/GEMS
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.