FANS: Fusing ASR and NLU for on-device SLU
Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya, Rastrow

TL;DR
FANS is an innovative end-to-end SLU model that directly infers intent and slot information from audio input, eliminating the need for separate transcription and understanding stages, and improves accuracy over existing models.
Contribution
The paper introduces FANS, a flexible neural architecture that fuses ASR and NLU into a single model for on-device spoken language understanding.
Findings
FANS reduces ICER errors by 30% on in-house data.
FANS reduces IRER errors by 7% on in-house data.
FANS outperforms state-of-the-art models on public SLU datasets.
Abstract
Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one maps the input audio to a transcript (ASR) and the second predicts the intent and slots from the transcript (NLU). In this paper, we introduce FANS, a new end-to-end SLU model that fuses an ASR audio encoder to a multi-task NLU decoder to infer the intent, slot tags, and slot values directly from a given input audio, obviating the need for transcription. FANS consists of a shared audio encoder and three decoders, two of which are seq-to-seq decoders that predict non null slot tags and slot values in parallel and in an auto-regressive manner. FANS neural encoder and decoders architectures are flexible which allows us to leverage different combinations of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
