Bidirectional Representations for Low Resource Spoken Language Understanding
Quentin Meeus, Marie-Francine Moens, Hugo Van hamme

TL;DR
This paper introduces a bidirectional speech representation model using masked language modeling, improving low-resource spoken language understanding and achieving state-of-the-art results with efficient class attention mechanisms.
Contribution
It proposes a novel bidirectional encoding approach for speech, leveraging masked language modeling and class attention for better low-resource spoken language understanding.
Findings
Better pre-fine-tuning representations than comparable models
State-of-the-art performance on Fluent Speech Command dataset
Effective in low-data regimes
Abstract
Most spoken language understanding systems use a pipeline approach composed of an automatic speech recognition interface and a natural language understanding module. This approach forces hard decisions when converting continuous inputs into discrete language symbols. Instead, we propose a representation model to encode speech in rich bidirectional encodings that can be used for downstream tasks such as intent prediction. The approach uses a masked language modelling objective to learn the representations, and thus benefits from both the left and right contexts. We show that the performance of the resulting encodings before fine-tuning is better than comparable models on multiple datasets, and that fine-tuning the top layers of the representation model improves the current state of the art on the Fluent Speech Command dataset, also in a low-data regime, when a limited amount of labelled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Class Attention
