Instance-Specific Test-Time Training for Speech Editing in the Wild
Taewoo Kim, Uijong Lee, Hayoung Park, Choongsang Cho, Nam In Park, Young Han Lee

TL;DR
This paper introduces an instance-specific test-time training approach for speech editing that adapts to diverse acoustic conditions, ensuring smooth transitions and precise control in real-world scenarios.
Contribution
It presents a novel test-time training method utilizing ground-truth features and auxiliary losses to improve speech editing in diverse acoustic environments.
Findings
Outperforms existing speech editing methods in objective metrics.
Achieves better subjective quality in real-world speech editing.
Enables precise control over speech rate during editing.
Abstract
Speech editing systems aim to naturally modify speech content while preserving acoustic consistency and speaker identity. However, previous studies often struggle to adapt to unseen and diverse acoustic conditions, resulting in degraded editing performance in real-world scenarios. To address this, we propose an instance-specific test-time training method for speech editing in the wild. Our approach employs direct supervision from ground-truth acoustic features in unedited regions and indirect supervision in edited regions via auxiliary losses based on duration constraints and phoneme prediction. This strategy mitigates the bandwidth discontinuity problem in speech editing, ensuring smooth acoustic transitions between unedited and edited regions. Additionally, it enables precise control over speech rate by adapting the model to target durations via mask length adjustment during test-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
