UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

Neta Glazer; Aviv Navon; Yael Segal; Aviv Shamsian; Hilit Segev; Asaf Buchnick; Menachem Pirchi; Gil Hetz; Joseph Keshet

arXiv:2506.09874·cs.SD·July 14, 2025

UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

Neta Glazer, Aviv Navon, Yael Segal, Aviv Shamsian, Hilit Segev, Asaf Buchnick, Menachem Pirchi, Gil Hetz, Joseph Keshet

PDF

Open Access

TL;DR

UmbraTTS is a flow-matching based TTS model that jointly synthesizes speech and environmental sounds, enabling context-aware audio generation with fine control over background elements, even without paired training data.

Contribution

The paper introduces UmbraTTS, a novel flow-matching TTS model that generates speech and environmental audio together, using a self-supervised framework to handle unpaired data.

Findings

01

Outperforms existing baselines in naturalness and environmental awareness

02

Produces diverse and coherent audio scenes

03

Allows fine-grained control over background volume

Abstract

Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsAttentive Walk-Aggregating Graph Neural Network