MAGE: Modality-Agnostic Music Generation and Editing
Muhammad Usama Saleem, Tejasvi Ravi, Tianyu Xu, Rajeev Nongpiur, Ishan Chatterjee, Mayur Jagdishbhai Patel, Pu Wang

TL;DR
MAGE is a unified, flexible framework for multimodal music generation and editing that leverages a flow-based Transformer and cross-modal grounding techniques to handle ambiguous, misaligned, or missing guidance.
Contribution
It introduces a modality-agnostic, continuous latent model with novel alignment and control mechanisms, enabling robust multimodal music creation and editing without multiple specialized models.
Findings
Supports effective multimodal-guided music generation and editing.
Achieves competitive quality with a lightweight, flexible interface.
Handles missing modalities through dynamic modality-masking during training.
Abstract
Multimodal music creation requires models that can both generate audio from high-level cues and edit existing mixtures in a targeted manner. Yet most multimodal music systems are built for a single task and a fixed prompting interface, making their conditioning brittle when guidance is ambiguous, temporally misaligned, or partially missing. Common additive fusion or feature concatenation further weakens cross-modal grounding, often causing prompt drift and spurious musical content during generation and editing. We propose MAGE, a modality-agnostic framework that unifies multimodal music generation and mixture-grounded editing within a single continuous latent formulation. At its core, MAGE uses a Controlled Multimodal FluxFormer, a flow-based Transformer that learns controllable latent trajectories for synthesis and editing under any available subset of conditions. To improve grounding,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
