# From Sound to Sight: Towards AI-authored Music Videos

**Authors:** Leo Vitasovic, Stella Gra{\ss}hof, Agnes Mercedes Kloft, Ville V. Lehtola, Martin Cunneen, Justyna Starostka, Glenn McGarry, Kun Li, Sami S. Brandt

arXiv: 2509.00029 · 2025-09-03

## TL;DR

This paper introduces two deep learning pipelines that automatically generate music videos from any song by analyzing musical features and creating visual content, aiming to surpass traditional handcrafted visualization methods.

## Contribution

It presents novel pipelines utilizing latent feature analysis and generative models for automatic, expressive music video creation from arbitrary songs.

## Key findings

- Demonstrated ability to analyze musical qualities like emotion and instrument patterns.
- Generated videos show storytelling potential and visual coherence.
- Preliminary user evaluation indicates emotional and visual alignment with music.

## Abstract

Conventional music visualisation systems rely on handcrafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual workflows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00029/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00029/full.md

## References

62 references — full list in the complete paper: https://tomesphere.com/paper/2509.00029/full.md

---
Source: https://tomesphere.com/paper/2509.00029