Practical token pruning for foundation models in few-shot conversational   virtual assistant systems

Haode Qi; Cheng Qian; Jian Ni; Pratyush Singh; Reza Fazeli; Gengyu; Wang; Zhongzheng Shu; Eric Wayne; Juergen Bross

arXiv:2408.11799·cs.CL·August 22, 2024

Practical token pruning for foundation models in few-shot conversational virtual assistant systems

Haode Qi, Cheng Qian, Jian Ni, Pratyush Singh, Reza Fazeli, Gengyu, Wang, Zhongzheng Shu, Eric Wayne, Juergen Bross

PDF

Open Access

TL;DR

This paper presents a practical token pruning method for transformer models in enterprise virtual assistants, significantly improving inference speed in few-shot intent classification without sacrificing accuracy.

Contribution

It introduces a multi-task adaptation approach for dynamic token pruning that enhances inference efficiency without requiring task-specific training.

Findings

01

Achieves state-of-the-art results in few-shot intent classification

02

Improves inference speed of transformer models for longer inputs

03

Maintains high accuracy with reduced computational cost

Abstract

In an enterprise Virtual Assistant (VA) system, intent classification is the crucial component that determines how a user input is handled based on what the user wants. The VA system is expected to be a cost-efficient SaaS service with low training and inference time while achieving high accuracy even with a small number of training samples. We pretrain a transformer-based sentence embedding model with a contrastive learning objective and leverage the embedding of the model as features when training intent classification models. Our approach achieves the state-of-the-art results for few-shot scenarios and performs better than other commercial solutions on popular intent classification benchmarks. However, generating features via a transformer-based model increases the inference time, especially for longer user inputs, due to the quadratic runtime of the transformer's attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques

MethodsSoftmax · travel james · Attention Is All You Need · Pruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Learning