Test-Time Training for Speech
Sri Harsha Dumpala, Chandramouli Sastry, Sageev Oore

TL;DR
This paper investigates the use of Test-Time Training (TTT) to adapt speech classification models to distribution shifts caused by noise and speaker variations, highlighting challenges and proposing BitFit for improved stability.
Contribution
It introduces applying TTT to speech tasks, identifies key challenges, and proposes BitFit as a parameter-efficient solution for stable adaptation.
Findings
BitFit improves stability over full fine-tuning in TTT for speech.
TTT faces scalability issues when adapting each example individually.
Hyperparameter sensitivity affects TTT effectiveness in speech applications.
Abstract
In this paper, we study the application of Test-Time Training (TTT) as a solution to handling distribution shifts in speech applications. In particular, we introduce distribution-shifts to the test datasets of standard speech-classification tasks -- for example, speaker-identification and emotion-detection -- and explore how Test-Time Training (TTT) can help adjust to the distribution-shift. In our experiments that include distribution shifts due to background noise and natural variations in speech such as gender and age, we identify some key-challenges with TTT including sensitivity to optimization hyperparameters (e.g., number of optimization steps and subset of parameters chosen for TTT) and scalability (e.g., as each example gets its own set of parameters, TTT is not scalable). Finally, we propose using BitFit -- a parameter-efficient fine-tuning algorithm proposed for text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
