TL;DR
Woosh is a comprehensive sound effects foundation model by Sony AI, offering high-quality audio generation and alignment tools, with competitive performance and accessible code and demos.
Contribution
First open sound effects foundation model with multiple modules, optimized for sound effects, and publicly released with code and demos.
Findings
Competitive performance against existing models like StableAudio-Open and TangoFlux
Includes multiple modules: audio encoder/decoder, text-audio alignment, text-to-audio, video-to-audio
Supports low-resource operation and fast inference
Abstract
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
