Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation
Shihao Zhao, Yitong Chen, Zeyinzi Jiang, Bojia Zi, Shaozhe Hao, Yu Liu, Chaojie Mao, Kwan-Yee K. Wong

TL;DR
Unison is a low-cost, fully automatic multimodal framework that unifies understanding and generation tasks across text, image, and video, with automatic task parsing and parameter extraction, achieving high performance with minimal training resources.
Contribution
It introduces a low-cost, fully automatic multimodal framework that covers diverse tasks and automatically parses user intentions, improving over prior manual or resource-intensive methods.
Findings
Achieves high accuracy with only 500k samples and 50 GPU hours
Automatically identifies task types and extracts meta-information
Performs well across understanding and generation tasks
Abstract
Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
