Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han

TL;DR
This paper introduces GOAT, a post-training method using GFlowNets to reduce hallucinations in LM-based TTS systems by aligning output distributions, improving accuracy without extra training or inference costs.
Contribution
The paper proposes a novel distribution alignment framework with GFlowNets for TTS, effectively mitigating hallucinations post-training without additional resource demands.
Findings
Reduced over 50% character error rates on challenging test cases.
Lowered model uncertainty by up to 58%.
Demonstrated strong generalization and effectiveness.
Abstract
Language Model (LM)-based Text-to-Speech (TTS) systems often generate hallucinated speech that deviates from input text. Existing mitigation strategies either demand excessive training resources or introduce significant inference latency. In this paper, we propose GFlOwNet-guided distribution AlignmenT (GOAT) for LM-based TTS, a post-training framework that mitigates hallucinations without relying on massive resources or inference cost. Specifically, we first conduct an uncertainty analysis, revealing a strong positive correlation between hallucination and model uncertainty. Based on this, we reformulate TTS generation as a trajectory flow optimization problem and introduce an enhanced Subtrajectory Balance objective together with a sharpened internal reward as target distribution. We further integrate reward temperature decay and learning rate optimization for stability and performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Multimodal Machine Learning Applications
