Practical token pruning for foundation models in few-shot conversational virtual assistant systems
Haode Qi, Cheng Qian, Jian Ni, Pratyush Singh, Reza Fazeli, Gengyu, Wang, Zhongzheng Shu, Eric Wayne, Juergen Bross

TL;DR
This paper presents a practical token pruning method for transformer models in enterprise virtual assistants, significantly improving inference speed in few-shot intent classification without sacrificing accuracy.
Contribution
It introduces a multi-task adaptation approach for dynamic token pruning that enhances inference efficiency without requiring task-specific training.
Findings
Achieves state-of-the-art results in few-shot intent classification
Improves inference speed of transformer models for longer inputs
Maintains high accuracy with reduced computational cost
Abstract
In an enterprise Virtual Assistant (VA) system, intent classification is the crucial component that determines how a user input is handled based on what the user wants. The VA system is expected to be a cost-efficient SaaS service with low training and inference time while achieving high accuracy even with a small number of training samples. We pretrain a transformer-based sentence embedding model with a contrastive learning objective and leverage the embedding of the model as features when training intent classification models. Our approach achieves the state-of-the-art results for few-shot scenarios and performs better than other commercial solutions on popular intent classification benchmarks. However, generating features via a transformer-based model increases the inference time, especially for longer user inputs, due to the quadratic runtime of the transformer's attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques
MethodsSoftmax · travel james · Attention Is All You Need · Pruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Learning
