TL;DR
This paper introduces a query rewriting framework with multi-LLM knowledge fusion to enhance speech instruction dataset construction, significantly improving data quality and usability for speech synthesis without human annotation.
Contribution
It presents a novel multi-LLM query rewriting method that refines text instructions for better speech synthesis, addressing limitations of current TTS models and reducing reliance on human data annotation.
Findings
Data usability increased from 72% to 93%
Effective zero-shot rewriting for complex instructions
Improved speech synthesis quality and dataset diversity
Abstract
End-to-end Large Speech Language Models~(\textbf{LSLMs}) demonstrate strong potential in response latency and speech comprehension capabilities, showcasing general intelligence across speech understanding tasks. However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks. Leveraging the rich ASR datasets, previous approaches have used Large Language Models~(\textbf{LLMs}) to continue the linguistic information of speech to construct speech instruction datasets. Yet, due to the gap between LLM-generated results and real human responses, the continuation methods further amplify these shortcomings. Given the high costs of collecting and annotating speech instruction datasets by humans, using speech synthesis to construct large-scale speech instruction datasets has become a balanced and robust alternative.…
Peer Reviews
Decision·Submitted to ICLR 2025
- Novel and practical solution to an important problem in speech instruction dataset creation. - Comprehensive evaluation across multiple datasets and metrics. - Detailed ablation studies validating each component. - Cost-effective compared to human annotation. - Strong experimental results showing clear improvements in both data quality and downstream task performance. - Good technical novelty in combining multiple LLMs and agents for robust performance.
- Limited analysis of failure cases and error patterns. - No direct comparison with other query rewriting methods from adjacent domains. - Validation relies heavily on embedding similarity - could benefit from human evaluation. - Parameter sensitivity analysis missing (e.g., impact of different thresholds). - Scalability and computational costs not thoroughly discussed.
Paper presents a detailed infrastructure for generation and filtering synthetic datasets for speech instruction datasets. Results show consistent improvement over naive TTS generation. Experimental results show across multiple datasets that the generated speech is higher quality across WER, SIM and PASS and that their technique improves downstream performance on NarrativeQA.
One of the main issues with this paper is the Main Result in Table 4 and Section 5.3. Authors only show that performance improves for exactly 1 model and on exactly 1 dataset. This is not sufficient evidence to claim that this technique extrapolates to other downstream tasks. This is very strange, since the authors clearly had all of these other datasets available that they could have used for evaluation. This table is also not very well explained; there is no experimental setup section about h
1. Generating high-quality synthetic speech instruction data is a valuable task for the community, yet it remains relatively underexplored. 2. The paper introduces a versatile framework that integrates existing LLMs, ASR, and text embedding methods in a plug-and-play manner, requiring no additional training aside from the knowledge fusion component. 3. Results from automatic metrics indicate that the proposed framework enhances the quality and usability of synthetic data.
Quality 1. The paper begins by highlighting the gap between LLM-generated responses and human responses, but it’s unclear if their framework effectively addresses this. Human responses often include disfluencies or may involve mid-sentence question reformulations. A more rigorous human-in-the-loop evaluation would be beneficial to assess if fine-tuning on their synthetic speech data genuinely enhances the speech language model's ability to follow "spoken" human instructions. 2. The quality of s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
