Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation

Shihao Zhao; Yitong Chen; Zeyinzi Jiang; Bojia Zi; Shaozhe Hao; Yu Liu; Chaojie Mao; Kwan-Yee K. Wong

arXiv:2512.07747·cs.CV·December 9, 2025

Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation

Shihao Zhao, Yitong Chen, Zeyinzi Jiang, Bojia Zi, Shaozhe Hao, Yu Liu, Chaojie Mao, Kwan-Yee K. Wong

PDF

Open Access

TL;DR

Unison is a low-cost, fully automatic multimodal framework that unifies understanding and generation tasks across text, image, and video, with automatic task parsing and parameter extraction, achieving high performance with minimal training resources.

Contribution

It introduces a low-cost, fully automatic multimodal framework that covers diverse tasks and automatically parses user intentions, improving over prior manual or resource-intensive methods.

Findings

01

Achieves high accuracy with only 500k samples and 50 GPU hours

02

Automatically identifies task types and extracts meta-information

03

Performs well across understanding and generation tasks

Abstract

Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning