Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii,, Takashi Shibuya, Yuki Mitsufuji

TL;DR
This paper introduces a simple, lightweight transformer model operating in discrete audio-visual spaces, achieving state-of-the-art results in image2audio and audio2image generation without complex large models.
Contribution
Proposes a modality-symmetrical transformer for multi-modal generation that surpasses recent methods in image2audio tasks, using a mask denoising training approach.
Findings
Outperforms recent image2audio generation methods
Operates effectively in discrete audio-visual spaces
Can be directly used for audio2image and co-generation
Abstract
In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing
MethodsDiffusion
