TL;DR
This paper introduces an agentic framework for interactive speech recognition that uses large language models as semantic evaluators and interaction agents, improving semantic accuracy and human-like correction capabilities.
Contribution
It presents a novel LLM-based semantic evaluation metric and an interactive agent framework for iterative refinement in ASR, addressing key gaps in current research.
Findings
Semantic-aware evaluation improves recognition quality.
LLM-driven interaction enables multi-turn iterative refinement.
Experiments show enhanced semantic fidelity and correction capability.
Abstract
Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
