Extreme Model Compression for On-device Natural Language Understanding
Kanthashree Mysore Sathyendra, Samridhi Choudhary, Leah, Nicolich-Henkin

TL;DR
This paper introduces an end-to-end task-aware compression method for neural NLU models, achieving 97.4% size reduction with minimal performance loss on large-scale, intent-based systems.
Contribution
It presents a novel joint compression technique for word embeddings and NLU tasks, outperforming baselines and enabling efficient on-device natural language understanding.
Findings
Achieves 97.4% compression rate with less than 3.7% performance degradation.
Task-aware compression leverages downstream signals for better efficiency.
Outperforms existing baseline methods on large-scale NLU data.
Abstract
In this paper, we propose and experiment with techniques for extreme compression of neural natural language understanding (NLU) models, making them suitable for execution on resource-constrained devices. We propose a task-aware, end-to-end compression approach that performs word-embedding compression jointly with NLU task learning. We show our results on a large-scale, commercial NLU system trained on a varied set of intents with huge vocabulary sizes. Our approach outperforms a range of baselines and achieves a compression rate of 97.4% with less than 3.7% degradation in predictive performance. Our analysis indicates that the signal from the downstream task is important for effective compression with minimal degradation in performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
