TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

Yinyi Luo; Wenwen Wang; Hayes Bai; Hongyu Zhu; Hao Chen; Pan He; Marios Savvides; Sharon Li; Jindong Wang

arXiv:2604.10784·cs.AI·May 21, 2026

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

Yinyi Luo, Wenwen Wang, Hayes Bai, Hongyu Zhu, Hao Chen, Pan He, Marios Savvides, Sharon Li, Jindong Wang

PDF

1 Repo

TL;DR

TorchUMM is a comprehensive, unified codebase for evaluating, analyzing, and post-training diverse multimodal models across various tasks, datasets, and architectures.

Contribution

It introduces the first unified framework supporting diverse UMM backbones, tasks, and datasets with standardized evaluation protocols.

Findings

01

Supports a broad spectrum of models and tasks.

02

Enables fair and reproducible comparisons.

03

Facilitates deeper insights into model strengths and limitations.

Abstract

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AIFrontierLab/TorchUMM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.