Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Shiqi Yang; Zhi Zhong; Mengjie Zhao; Shusuke Takahashi; Masato Ishii,; Takashi Shibuya; Yuki Mitsufuji

arXiv:2405.14598·cs.CV·May 27, 2024

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii,, Takashi Shibuya, Yuki Mitsufuji

PDF

Open Access

TL;DR

This paper introduces a simple, lightweight transformer model operating in discrete audio-visual spaces, achieving state-of-the-art results in image2audio and audio2image generation without complex large models.

Contribution

Proposes a modality-symmetrical transformer for multi-modal generation that surpasses recent methods in image2audio tasks, using a mask denoising training approach.

Findings

01

Outperforms recent image2audio generation methods

02

Operates effectively in discrete audio-visual spaces

03

Can be directly used for audio2image and co-generation

Abstract

In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing

MethodsDiffusion