BERT for Joint Intent Classification and Slot Filling

Qian Chen; Zhu Zhuo; Wen Wang

arXiv:1902.10909·cs.CL·March 1, 2019

BERT for Joint Intent Classification and Slot Filling

Qian Chen, Zhu Zhuo, Wen Wang

PDF

5 Repos 4 Models

TL;DR

This paper leverages BERT to improve joint intent classification and slot filling in natural language understanding, achieving significant performance gains over previous models on benchmark datasets.

Contribution

It introduces a BERT-based joint model for intent classification and slot filling, demonstrating substantial improvements over prior attention-based and slot-gated models.

Findings

01

Significant improvement in intent classification accuracy.

02

Enhanced slot filling F1 scores.

03

Higher sentence-level semantic frame accuracy.

Abstract

Intent classification and slot filling are two essential tasks for natural language understanding. They often suffer from small-scale human-labeled training data, resulting in poor generalization capability, especially for rare words. Recently a new language representation model, BERT (Bidirectional Encoder Representations from Transformers), facilitates pre-training deep bidirectional representations on large-scale unlabeled corpora, and has created state-of-the-art models for a wide variety of natural language processing tasks after simple fine-tuning. However, there has not been much effort on exploring BERT for natural language understanding. In this work, we propose a joint intent classification and slot filling model based on BERT. Experimental results demonstrate that our proposed model achieves significant improvement on intent classification accuracy, slot filling F1, and…

Tables4

Table 1. Table 1: An example from user query to semantic frame.

Query	Find me a movie by Steven Spielberg
Frame	Intent	find_movie
	Slot	genre = movie
	Slot	directed_by = Steven Spielberg

Table 2. Table 2: NLU performance on Snips and ATIS datasets. The metrics are intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy (%). The results for the first group of models are cited from Goo et al. ( 2018 ) .

Models	Snips			ATIS
Models	Intent	Slot	Sent	Intent	Slot	Sent
RNN-LSTM (Hakkani-Tür et al., 2016)	96.9	87.3	73.2	92.6	94.3	80.7
Atten.-BiRNN (Liu and Lane, 2016)	96.7	87.8	74.1	91.1	94.2	78.9
Slot-Gated (Goo et al., 2018)	97.0	88.8	75.5	94.1	95.2	82.6
Joint BERT	98.6	97.0	92.8	97.5	96.1	88.2
Joint BERT + CRF	98.4	96.7	92.6	97.9	96.0	88.6

Table 3. Table 3: Ablation Analysis for the Snips dataset.

Model	Epochs	Intent	Slot
Joint BERT	30	98.6	97.0
No joint	30	98.0	95.8
Joint BERT	40	98.3	96.4
Joint BERT	20	99.0	96.0
Joint BERT	10	98.6	96.5
Joint BERT	5	98.0	95.1
Joint BERT	1	98.0	93.3

Table 4. Table 4: A case in the Snips dataset.

Query	need to see mother joan of the angels in one second
Gold, predicted by joint BERT correctly
Intent	SearchScreeningEvent
Slots	O O O B-movie-name I-movie-name I-movie-name I-movie-name I-movie-name B-timeRange I-timeRange I-timeRange
Predicted by Slot-Gated Model Goo et al. (2018)
Intent	BookRestaurant
Slots	O O O B-object-name I-object-name I-object-name I-object-name I-object-name B-timeRange I-timeRange I-timeRange

Equations6

y^{i} = softmax (W^{i} h_{1} + b^{i}),

y^{i} = softmax (W^{i} h_{1} + b^{i}),

\displaystyle y^{s}_{n}=\mathrm{softmax}({\bf{W}}^{s}{\bm{h}}_{n}+{\bm{b}}^{s})\,,n\in{1\dots N}\

\displaystyle y^{s}_{n}=\mathrm{softmax}({\bf{W}}^{s}{\bm{h}}_{n}+{\bm{b}}^{s})\,,n\in{1\dots N}\

p (y^{i}, y^{s} ∣ x) = p (y^{i} ∣ x) n = 1 \prod N p (y_{n}^{s} ∣ x),

p (y^{i}, y^{s} ∣ x) = p (y^{i} ∣ x) n = 1 \prod N p (y_{n}^{s} ∣ x),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax

Full text

BERT for Joint Intent Classification and Slot Filling

Qian Chen, Zhu Zhuo, Wen Wang

Speech Lab, DAMO Academy, Alibaba Group

{tanqing.cq, zhuozhu.zz, w.wang}@alibaba-inc.com Ongoing work.

Abstract

Intent classification and slot filling are two essential tasks for natural language understanding. They often suffer from small-scale human-labeled training data, resulting in poor generalization capability, especially for rare words. Recently a new language representation model, BERT (Bidirectional Encoder Representations from Transformers), facilitates pre-training deep bidirectional representations on large-scale unlabeled corpora, and has created state-of-the-art models for a wide variety of natural language processing tasks after simple fine-tuning. However, there has not been much effort on exploring BERT for natural language understanding. In this work, we propose a joint intent classification and slot filling model based on BERT. Experimental results demonstrate that our proposed model achieves significant improvement on intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy on several public benchmark datasets, compared to the attention-based recurrent neural network models and slot-gated models.

1 Introduction

In recent years, a variety of smart speakers have been deployed and achieved great success, such as Google Home, Amazon Echo, Tmall Genie, which facilitate goal-oriented dialogues and help users to accomplish their tasks through voice interactions. Natural language understanding (NLU) is critical to the performance of goal-oriented spoken dialogue systems. NLU typically includes the intent classification and slot filling tasks, aiming to form a semantic parse for user utterances. Intent classification focuses on predicting the intent of the query, while slot filling extracts semantic concepts. Table 1 shows an example of intent classification and slot filling for user query “Find me a movie by Steven Spielberg”.

Intent classification is a classification problem that predicts the intent label $y^{i}$ and slot filling is a sequence labeling task that tags the input word sequence $x=(x_{1},x_{2},\cdots,x_{T})$ with the slot label sequence $y^{s}=(y^{s}_{1},y^{s}_{2},\cdots,y^{s}_{T})$ . Recurrent neural network (RNN) based approaches, particularly gated recurrent unit (GRU) and long short-term memory (LSTM) models, have achieved state-of-the-art performance for intent classification and slot filling. Recently, several joint learning methods for intent classification and slot filling were proposed to exploit and model the dependencies between the two tasks and improve the performance over independent models (Guo et al., 2014; Hakkani-Tür et al., 2016; Liu and Lane, 2016; Goo et al., 2018). Prior work has shown that attention mechanism (Bahdanau et al., 2014) helps RNNs to deal with long-range dependencies. Hence, attention-based joint learning methods were proposed and achieved the state-of-the-art performance for joint intent classification and slot filling (Liu and Lane, 2016; Goo et al., 2018).

Lack of human-labeled data for NLU and other natural language processing (NLP) tasks results in poor generalization capability. To address the data sparsity challenge, a variety of techniques were proposed for training general purpose language representation models using an enormous amount of unannotated text, such as ELMo (Peters et al., 2018) and Generative Pre-trained Transformer (GPT) (Radford et al., 2018). Pre-trained models can be fine-tuned on NLP tasks and have achieved significant improvement over training on task-specific annotated data. More recently, a pre-training technique, Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), was proposed and has created state-of-the-art models for a wide variety of NLP tasks, including question answering (SQuAD v1.1), natural language inference, and others.

However, there has not been much effort in exploring BERT for NLU. The technical contributions in this work are two folds: 1) we explore the BERT pre-trained model to address the poor generalization capability of NLU; 2) we propose a joint intent classification and slot filling model based on BERT and demonstrate that the proposed model achieves significant improvement on intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy on several public benchmark datasets, compared to attention-based RNN models and slot-gated models.

2 Related work

Deep learning models have been extensively explored in NLU. According to whether intent classification and slot filling are modeled separately or jointly, we categorize NLU models into independent modeling approaches and joint modeling approaches.

Approaches for intent classification include CNN (Kim, 2014; Zhang et al., 2015), LSTM (Ravuri and Stolcke, 2015), attention-based CNN (Zhao and Wu, 2016), hierarchical attention networks (Yang et al., 2016), adversarial multi-task learning (Liu et al., 2017), and others. Approaches for slot filling include CNN (Vu, 2016), deep LSTM (Yao et al., 2014), RNN-EM (Peng et al., 2015), encoder-labeler deep LSTM (Kurata et al., 2016), and joint pointer and attention (Zhao and Feng, 2018), among others.

Joint modeling approaches include CNN-CRF (Xu and Sarikaya, 2013), RecNN (Guo et al., 2014), joint RNN-LSTM (Hakkani-Tür et al., 2016), attention-based BiRNN (Liu and Lane, 2016), and slot-gated attention-based model (Goo et al., 2018).

3 Proposed Approach

We first briefly describe the BERT model (Devlin et al., 2018) and then introduce the proposed joint model based on BERT. Figure 1 illustrates a high-level view of the proposed model.

3.1 BERT

The model architecture of BERT is a multi-layer bidirectional Transformer encoder based on the original Transformer model (Vaswani et al., 2017). The input representation is a concatenation of WordPiece embeddings (Wu et al., 2016), positional embeddings, and the segment embedding. Specially, for single sentence classification and tagging tasks, the segment embedding has no discrimination. A special classification embedding ([CLS]) is inserted as the first token and a special token ([SEP]) is added as the final token. Given an input token sequence ${\bm{x}}=(x_{1},\dots,x_{T})$ , the output of BERT is ${\bf{H}}=({\bm{h}}_{1},\dots,{\bm{h}}_{T})$ .

The BERT model is pre-trained with two strategies on large-scale unlabeled text, i.e., masked language model and next sentence prediction. The pre-trained BERT model provides a powerful context-dependent sentence representation and can be used for various target tasks, i.e., intent classification and slot filling, through the fine-tuning procedure, similar to how it is used for other NLP tasks.

3.2 Joint Intent Classification and Slot Filling

BERT can be easily extended to a joint intent classification and slot filling model. Based on the hidden state of the first special token ([CLS]), denoted ${\bm{h}}_{1}$ , the intent is predicted as:

[TABLE]

For slot filling, we feed the final hidden states of other tokens $\bm{h}_{2},\dots,\bm{h}_{T}$ into a softmax layer to classify over the slot filling labels. To make this procedure compatible with the WordPiece tokenization, we feed each tokenized input word into a WordPiece tokenizer and use the hidden state corresponding to the first sub-token as input to the softmax classifier.

[TABLE]

where ${\bm{h}}_{n}$ is the hidden state corresponding to the first sub-token of word $x_{n}$ .

To jointly model intent classification and slot filling, the objective is formulated as:

[TABLE]

The learning objective is to maximize the conditional probability $p(y^{i},y^{s}|{\bm{x}})$ . The model is fine-tuned end-to-end via minimizing the cross-entropy loss.

3.3 Conditional

Random Field

Slot label predictions are dependent on predictions for surrounding words. It has been shown that structured prediction models can improve the slot filling performance, such as conditional random fields (CRF). Zhou and Xu (2015) improves semantic role labeling by adding a CRF layer for a BiLSTM encoder. Here we investigate the efficacy of adding CRF for modeling slot label dependencies, on top of the joint BERT model.

4 Experiments and Analysis

We evaluate the proposed model on two public benchmark datasets, ATIS and Snips.

4.1 Data

The ATIS dataset (Tür et al., 2010) is widely used in NLU research, which includes audio recordings of people making flight reservations. We use the same data division as Goo et al. (2018) for both datasets. The training, development and test sets contain 4,478, 500 and 893 utterances, respectively. There are 120 slot labels and 21 intent types for the training set. We also use Snips (Coucke et al., 2018), which is collected from the Snips personal voice assistant. The training, development and test sets contain 13,084, 700 and 700 utterances, respectively. There are 72 slot labels and 7 intent types for the training set.

4.2 Training Details

We use English uncased BERT-Base model111https://github.com/google-research/bert, which has 12 layers, 768 hidden states, and 12 heads. BERT is pre-trained on BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For fine-tuning, all hyper-parameters are tuned on the development set. The maximum length is 50. The batch size is 128. Adam Kingma and Ba (2014) is used for optimization with an initial learning rate of 5e-5. The dropout probability is 0.1. The maximum number of epochs is selected from [1, 5, 10, 20, 30, 40].

4.3 Results

Table 2 shows the model performance as slot filling F1, intent classification accuracy, and sentence-level semantic frame accuracy on the Snips and ATIS datasets.

The first group of models are the baselines and it consists of the state-of-the-art joint intent classification and slot filling models: sequence-based joint model using BiLSTM (Hakkani-Tür et al., 2016), attention-based model (Liu and Lane, 2016), and slot-gated model (Goo et al., 2018).

The second group of models includes the proposed joint BERT models. As can be seen from Table 2, joint BERT models significantly outperform the baseline models on both datasets. On Snips, joint BERT achieves intent classification accuracy of 98.6% (from 97.0%), slot filling F1 of 97.0% (from 88.8%), and sentence-level semantic frame accuracy of 92.8% (from 75.5%). On ATIS, joint BERT achieves intent classification accuracy of 97.5% (from 94.1%), slot filling F1 of 96.1% (from 95.2%), and sentence-level semantic frame accuracy of 88.2% (from 82.6%). Joint BERT+CRF replaces the softmax classifier with CRF and it performs comparably to BERT, probably due to the self-attention mechanism in Transformer, which may have sufficiently modeled the label structures.

Compared to ATIS, Snips includes multiple domains and has a larger vocabulary. For the more complex Snips dataset, joint BERT achieves a large gain in the sentence-level semantic frame accuracy, from 75.5% to 92.8% (22.9% relative). This demonstrates the strong generalization capability of joint BERT model, considering that it is pre-trained on large-scale text from mismatched domains and genres (books and wikipedia). On ATIS, joint BERT also achieves significant improvement on the sentence-level semantic frame accuracy, from 82.6% to 88.2% (6.8% relative).

4.4 Ablation Analysis and Case Study

We conduct ablation analysis on Snips, as shown in Table 3. Without joint learning, the accuracy of intent classification drops to 98.0% (from 98.6%), and the slot filling F1 drops to 95.8% (from 97.0%). We also compare the joint BERT model with different fine-tuning epochs. The joint BERT model fine-tuned with only 1 epoch already outperforms the first group of models in Table 2.

We further select a case from Snips, as in Table 4, showing how joint BERT outperforms the slot-gated model Goo et al. (2018) by exploiting the language representation power of BERT to improve the generalization capability. In this case, “mother joan of the angels” is wrongly predicted by the slot-gated model as an object name and the intent is also wrong. However, joint BERT correctly predicts the slot labels and intent because “mother joan of the angels” is a movie entry in Wikipedia. The BERT model was pre-trained partly on Wikipedia and possibly learned this information for this rare phrase.

5 Conclusion

We propose a joint intent classification and slot filling model based on BERT, aiming at addressing the poor generalization capability of traditional NLU models. Experimental results show that our proposed joint BERT model outperforms BERT models modeling intent classification and slot filling separately, demonstrating the efficacy of exploiting the relationship between the two tasks. Our proposed joint BERT model achieves significant improvement on intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy on ATIS and Snips datasets over previous state-of-the-art models. Future work includes evaluations of the proposed approach on other large-scale and more complex NLU datasets, and exploring the efficacy of combining external knowledge with BERT.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate . Co RR , abs/1409.0473.
2Coucke et al. (2018) Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, and Joseph Dureau. 2018. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces . Co RR , abs/1805.10190.
3Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding . Co RR , abs/1810.04805.
4Goo et al. (2018) Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. 2018. Slot-gated modeling for joint slot filling and intent prediction . In NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers) , pages 753–757.
5Guo et al. (2014) Daniel Guo, Gökhan Tür, Wen-tau Yih, and Geoffrey Zweig. 2014. Joint semantic utterance classification and slot filling with recursive neural networks . In 2014 IEEE Spoken Language Technology Workshop, SLT 2014, South Lake Tahoe, NV, USA, December 7-10, 2014 , pages 554–559. · doi ↗
6Hakkani-Tür et al. (2016) Dilek Hakkani-Tür, Gökhan Tür, Asli Çelikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM . In Interspeech 2016, San Francisco, CA, USA, September 8-12, 2016 , pages 715–719. · doi ↗
7Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification . In EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL , pages 1746–1751. ACL.
8Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization . Co RR , abs/1412.6980.