UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models
Sen Fang, Bowen Gao, Yangjian Wu, Teik Toe Teoh

TL;DR
UniBriVL is a universal multimodal model that embeds audio, image, and text into a shared space, enabling robust cross-modal applications and image generation from audio signals.
Contribution
It introduces UniBriVL, a novel universal representation learning method that effectively captures audio-image correlations and supports multimodal tasks.
Findings
Demonstrates effective embedding of audio, image, and text in a shared space.
Shows capability to generate images from audio inputs.
Proves efficacy in downstream multimodal tasks.
Abstract
Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we propose a novel universal language representation learning method called UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal BriVL embeds audio, image, and text into a shared space, enabling the realization of various multimodal applications. Our approach addresses major challenges in robust language (both text and audio) representation learning and effectively captures the correlation between audio and image. Additionally, we demonstrate the qualitative evaluation of the generated images from UniBriVL, which serves to highlight the potential of our approach in creating images from audio. Overall, our experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech Recognition and Synthesis
