UniBriVL: Robust Universal Representation and Generation of Audio Driven   Diffusion Models

Sen Fang; Bowen Gao; Yangjian Wu; Teik Toe Teoh

arXiv:2307.15898·cs.SD·September 12, 2023

UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models

Sen Fang, Bowen Gao, Yangjian Wu, Teik Toe Teoh

PDF

Open Access

TL;DR

UniBriVL is a universal multimodal model that embeds audio, image, and text into a shared space, enabling robust cross-modal applications and image generation from audio signals.

Contribution

It introduces UniBriVL, a novel universal representation learning method that effectively captures audio-image correlations and supports multimodal tasks.

Findings

01

Demonstrates effective embedding of audio, image, and text in a shared space.

02

Shows capability to generate images from audio inputs.

03

Proves efficacy in downstream multimodal tasks.

Abstract

Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we propose a novel universal language representation learning method called UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal BriVL embeds audio, image, and text into a shared space, enabling the realization of various multimodal applications. Our approach addresses major challenges in robust language (both text and audio) representation learning and effectively captures the correlation between audio and image. Additionally, we demonstrate the qualitative evaluation of the generated images from UniBriVL, which serves to highlight the potential of our approach in creating images from audio. Overall, our experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech Recognition and Synthesis