Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model
Da-Rong Liu, Chunxi Liu, Frank Zhang, Gabriel Synnaeve, Yatharth, Saraf, Geoffrey Zweig

TL;DR
This paper enhances ASR lattice rescoring by integrating video metadata through attention-based contextual vectors and a hybrid pointer network, improving recognition performance in social media videos.
Contribution
It introduces a novel hybrid pointer network approach and an attention-based method to incorporate video metadata into ASR lattice rescoring, which was not previously explored.
Findings
Both methods improve ASR performance by leveraging video metadata.
The hybrid pointer network explicitly models word probabilities from metadata.
Experimental results show significant performance gains.
Abstract
Videos uploaded on social media are often accompanied with textual descriptions. In building automatic speech recognition (ASR) systems for videos, we can exploit the contextual information provided by such video metadata. In this paper, we explore ASR lattice rescoring by selectively attending to the video descriptions. We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model during lattice rescoring. Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata. We perform experimental evaluations on both language modeling and ASR tasks, and demonstrate that both proposed methods provide performance improvements by selectively leveraging the video metadata.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsSigmoid Activation · Softmax · Tanh Activation · Long Short-Term Memory · [LivE@PeRson]How do I talk to a real person at Expedia? · Pointer Network
