Supervised Domain Enablement Attention for Personalized Domain   Classification

Joo-Kyung Kim; Young-Bum Kim

arXiv:1812.07546·cs.CL·December 19, 2018

Supervised Domain Enablement Attention for Personalized Domain Classification

Joo-Kyung Kim, Young-Bum Kim

PDF

Open Access

TL;DR

This paper introduces a supervised enablement attention mechanism for personalized domain classification that uses sigmoid activation and supervised learning to enhance attention expressiveness and improve classification accuracy.

Contribution

It proposes a novel supervised enablement attention method utilizing sigmoid activation and self-distillation, advancing personalized domain classification techniques.

Findings

01

Significant improvement in domain classification accuracy.

02

Effective use of supervised attention with sigmoid activation.

03

Enhanced expressiveness of attention weights.

Abstract

In large-scale domain classification for natural language understanding, leveraging each user's domain enablement information, which refers to the preferred or authenticated domains by the user, with attention mechanism has been shown to improve the overall domain classification performance. In this paper, we propose a supervised enablement attention mechanism, which utilizes sigmoid activation for the attention weighting so that the attention can be computed with more expressive power without the weight sum constraint of softmax attention. The attention weights are explicitly encouraged to be similar to the corresponding elements of the ground-truth's one-hot vector by supervised attention, and the attention information of the other enabled domains is leveraged through self-distillation. By evaluating on the actual utterances from a large-scale IPDA, we show that our approach…

Figures1

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: Accuracies (%) on a test set with biased ground-truth inclusion in the enabled domains (90%) (left) and a test set with unbiased inclusion (70%) (right) with various enablement attention methods. sftm , sgmd , spvs , sdst , and bias denote softmax, sigmoid, supervised, self-distilled, and domain enablement bias, respectively.

Model no	Attention method	Biased ground-truth inclusion			Unbiased ground-truth inclusion
Model no	Attention method	Top1	MRR	Top3	Top1	MRR	Top3
(1)	sfm	95.81	97.27	99.08	90.65	93.60	97.31
(2)	sgmd	95.98	97.43	99.19	91.03	93.92	97.49
(3)	sgmd, spvs	96.10	97.50	99.21	91.11	93.98	97.53
(4)	sgmd, spvs, sdst	96.29	97.65	99.32	91.33	94.14	97.62
(5)	sfm, bias	97.01	98.26	99.75	90.07	93.03	96.84
(6)	sgmd, spvs, sdst, bias	97.48	98.51	99.76	90.58	93.30	96.73

Table 2. Table 2: Sample utterances correctly predicted with model (4) but not with model (1) and (2).

Utterance

Ground-truth

Enabled domain: [attention weights for model (1), (2), and (4)], …

what is the price of bitcoin

Crypto Price

Sleep and Relaxation Sounds: [0.9998, 0.0004, 0.2029],

Crypto Price: [0.0001, 9.21e-0.6, 0.9977]

find me a round trip ticket flight

Expedia

Expedia: [0.0048, 5.37e-08, 0.6205], KAYAK: [0.9952, 0.0004, 0.461]

find my phone

Find My Phone

The Name Game: [1.0, 0.0001, 0.1677]

Equations12

a_{e} = σ (u \cdot v_{e}),

a_{e} = σ (u \cdot v_{e}),

L_{m} = - i = 1 \sum n y_{i} lo g o_{i} + (1 - y_{i}) lo g (1 - o_{i}),

L_{m} = - i = 1 \sum n y_{i} lo g o_{i} + (1 - y_{i}) lo g (1 - o_{i}),

L_{a} = - e \in E \sum y_{e} lo g a_{e} + (1 - y_{e}) lo g (1 - a_{e}),

L_{a} = - e \in E \sum y_{e} lo g a_{e} + (1 - y_{e}) lo g (1 - a_{e}),

L_{d} = - e \in E \sum \tilde{a_{e}} lo g a_{e} + (1 - \tilde{a_{e}}) lo g (1 - a_{e}),

L_{d} = - e \in E \sum \tilde{a_{e}} lo g a_{e} + (1 - \tilde{a_{e}}) lo g (1 - a_{e}),

\tilde{a_{e}} = σ (\frac{u \cdot v _{e}}{T}),

\tilde{a_{e}} = σ (\frac{u \cdot v _{e}}{T}),

L = L_{m} + α {(1 - β) L_{a} + β^{t} L_{d}},

L = L_{m} + α {(1 - β) L_{a} + β^{t} L_{d}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

Full text

Supervised Domain Enablement Attention for Personalized Domain Classification

Joo-Kyung Kim Young-Bum Kim

Amazon Alexa

{jookyk,youngbum}@amazon.com

Abstract

In large-scale domain classification for natural language understanding, leveraging each user’s domain enablement information, which refers to the preferred or authenticated domains by the user, with attention mechanism has been shown to improve the overall domain classification performance. In this paper, we propose a supervised enablement attention mechanism, which utilizes sigmoid activation for the attention weighting so that the attention can be computed with more expressive power without the weight sum constraint of softmax attention. The attention weights are explicitly encouraged to be similar to the corresponding elements of the ground-truth’s one-hot vector by supervised attention, and the attention information of the other enabled domains is leveraged through self-distillation. By evaluating on the actual utterances from a large-scale IPDA, we show that our approach significantly improves domain classification performance.

1 Introduction

Due to recent advances in deep learning techniques, intelligent personal digital assistants (IPDAs) such as Amazon Alexa, Google Assistant, Microsoft Cortana, and Apple Siri have been widely used as real-life applications of natural language understanding (Sarikaya et al., 2016; Sarikaya, 2017).

In natural language understanding, domain classification is a task that finds the most relevant domain given an input utterance (Tur and de Mori, 2011). For example, “make a lion sound” and “find me an apple pie recipe” should be classified as ZooKeeper and AllRecipe, respectively. Recent IPDAs cover more than several thousands of diverse domains by including third-party developed domains such as Alexa Skills (Kumar et al., 2017; Kim et al., 2018a; Kim and Kim, 2018), Google Actions, and Cortana Skills, which makes domain classification to be a more challenging task.

Given a large number of domains, leveraging user’s enabled domain information111Enabled domain information specifically refers to preferred or authenticated domains in Amazon Alexa, but it can be extended to other information such as the list of recently used domains. has been shown to improve the domain classification performance since enabled domains reflect the user’s context in terms of domain usage (Kim et al., 2018b). For an input utterance, Kim et al. (2018b) use attention mechanism so that a weighted sum of the enabled domain vectors are used as an input signal as well as the utterance vector. The enabled domain vectors and the attention weights are automatically trained in an end-to-end fashion to be helpful for the domain classification.

In this paper, we propose a supervised enablement attention mechanism for more effective attention on the enabled domains. First, we use logistic sigmoid instead of softmax as the attention activation function to relax the constraint that the weight sum over all the enabled domains is 1 to the constraint that each attention weight is between 0 and 1 regardless of the other weights (Martins and Astudillo, 2016; Kim et al., 2017). Therefore, all the attention weights can be very low if there are no enabled domains relevant to a ground-truth so that we can disregard the irrelevant enabled domains, and multiple attention weights can have high values when multiple enabled domains are helpful for disambiguating an input utterance. Second, we encourage each attention weight to be high if the corresponding enabled domain is a ground-truth domain and if otherwise, to be low, by a supervised attention method (Mi et al., 2016) so that the attention weights can be directly tuned for the downstream classification task. Third, we apply self-distillation (Furlanello et al., 2018) on top of the enablement attention weights so that we can better utilize the enabled domains that are not ground-truth domains but still relevant.

Evaluating on datasets obtained from real usage in a large-scale IPDA, we show that our approach significantly improves domain classification performance by utilizing the domain enablement information effectively.

2 Model

Figure 1 shows the overall architecture of the proposed model.

Given an input utterance, each word of the utterance is represented as a dense vector through word embedding followed by bidirectional long short-term memory (BiLSTM) (Graves and Schmidhuber, 2005). Then, an utterance vector is composed by concatenating the last outputs of the forward LSTM and the backward LSTM.222We have also evaluated word vector summation, CNN (Kim, 2014), BiLSTM mean-pooling, and BiLSTM max-pooling (Conneau et al., 2017) as alternative utterance representation methods, but they did not show better performance on our task.

To represent the domain enablement information, we obtain a weighted sum of domain enablement vector where the weights are calculated by logistic sigmoid function on top of the multiplicative attention (Luong et al., 2015) for the utterance vector and the domain enablement vectors. The attention weight of an enabled domain $e$ is formulated as follows:

[TABLE]

where $u$ is the utterance vector, $v_{e}$ is the enablement vector of enabled domain $e$ , and $\sigma$ is sigmoid function. Compared to conventional attention mechanism using softmax function, which constraints the sum of the attention weights to be 1, sigmoid attention has more expressive power, where each attention weight can be between 0 and 1 regardless of the other weights. We show that using sigmoid attention is actually more effective for improving prediction performance in Section 3.

The utterance vector and the weighted sum of the domain enablement vectors are concatenated to represent the utterance and the domain enablement as a single vector. Given the concatenated vector, a feed-forward neural network with a single hidden layer333We utilize scaled exponential linear units (SeLU) as the activation function for the hidden layer(Klambauer et al., 2017). is used to predict the confidence score by logistic sigmoid function for each domain.

One issue of the proposed architecture is that the domain enablement can be trained to be a very strong signal, where one of the enabled domains would be the predicted domains regardless of the relevancy of the utterances to the predicted domains in many cases. To reduce this prediction bias, we use randomly sampled enabled domains instead of the correct enabled domains of an input utterance with 50% probability during training so that the domain enablement is used as an auxiliary signal rather than determining signal. During inference, we always use the correct domain enablements of the given utterances.

The main loss function of our model is formulated as binary log loss between the confidence score and the ground-truth vector as follows:

[TABLE]

where $n$ is the number of all domains, $o$ is an $n$ -dimensional confidence score vector from the model, and $y$ is an $n$ -dimensional one-hot vector whose element corresponding to the position of the ground-truth domain is set to 1.

2.1 Supervised Enablement Attention

Attention weights are originally intended to be automatically trained in an end-to-end fashion (Bahdanau et al., 2015), but it has been shown that applying proper explicit supervision to the attention improves the downstream tasks such as machine translation given the word alignment and constituent parsing given annotations between surface words and nonterminals (Mi et al., 2016; Liu et al., 2016; Kamigaito et al., 2017).

We hypothesize that if the ground-truth domain is one of the enabled domains, the attention weight for the ground-truth domain should be high and vice versa. To apply this hypothesis in the model training as a supervised attention method, we formulate an auxiliary loss function as follows:

[TABLE]

where $E$ is a set of enabled domains and $a_{e}$ is the attention weight for the enabled domain $e$ .

2.2 Self-Distilled Attention

One issue of supervised attention in Section 2.1 is that enabled domains that are not ground-truth domains are encouraged to have lower attention weights regardless of their relevancies to the input utterances and the ground-truth domains. Distillation methods utilize not only the ground-truth but also all the output activations of a source model so that all the prediction information from the source model can be utilized for more effective knowledge transfer between the source model and the target model (Hinton et al., 2014). Self-distillation, which trains a model leveraging the outputs of the source model with the same architecture or capacity, has been shown to improve the target model’s performance with a distillation method (Furlanello et al., 2018).

We use a variant of self-distillation methods, where the model outputs at the previous epoch with the best dev set performance are used as the soft targets for the distillation,444This approach is closely related to Temporal Ensembling (Laine and Aila, 2017), but we just leverage the model outputs at the previous epoch rather than accumulating the outputs over multiple epochs. so that the enabled domains that are not ground-truths can also be used for the supervised attention. While conventional distillation methods utilize softmax activations as the target values, we show that distillation on top of sigmoid activations is also effective without loss of generality. The loss function for the self-distillation on the attention weights is formulated as follows:

[TABLE]

where $\tilde{a_{e}}$ is the attention weight of the model showing the dev set performance in the previous epochs. It is formulated as:

[TABLE]

where $T$ is the temperature for sufficient usage of all the attention weights as the soft target. In this work, we set $T$ to be 16, which shows the best dev set performance.

We have also evaluated soft-target regularization (Aghajanyan, 2017), where a weighted sum of the hard ground-truth target vector and the soft target vector is used as a single target vector, but it did not show better performance than self-distillation.

All the described loss functions are added to compose a single loss function as follows:

[TABLE]

where $\alpha$ is a coefficient representing the degree of supervised enablement attention and $\beta^{t}$ denotes the degree of the self-distillation. We set $\alpha$ to be 0.01 in this work. Following Hu et al. (2016), $\beta^{t}=1-0.95^{t}$ , where $t$ denotes the current training epoch starting from 0 so that the hard ground-truth targets are more influential in the early epochs and the self-distillation is more utilized in the late epochs.

3 Experiments

We evaluate our proposed model on domain classification leveraging enabled domains. The enabled domains can be a crucial disambiguating signal especially when there are multiple similar domains. For example, assume that the input utterance is “what’s the weather” and there are multiple weather-related domains such as NewYorkWeather, AccuWeather, and WeatherChannel. In this case, if WeatherChannel is included as an enabled domain of the current user, it is likely to be the most relevant domain to the user.

3.1 Datasets

Following the data collection methods used in Kim et al. (2018b), our models are trained using utterances with explicit invocation patterns. For example, given a user’s utterance, “Ask {ZooKeeper} *to *{play peacock sound},” “play peacock sound” and ZooKeeper are extracted to compose a pair of the utterance and the ground-truth, respectively. In this way, we have generated train, development, and test sets containing 4.4M, 500K, and 500K utterances, respectively. All the utterances are from the usage log of Amazon Alexa and the ground-truth of each utterance is one of 1K frequently used domains. The average number of enabled domains per utterance in the test sets is 8.47.

One issue of this collected data sets is that the ground-truth is included in the enabled domains for more than 90% of the utterances, where the ground-truths are biased to enabled domains.555Since the data collection method leverages utterances where users already know the exact domain names, such domains are likely to be the enabled domains of the users. For more correct and unbiased evaluation of the models on the input utterances from real live traffic, we also evaluate the models on the same sized train, development, and test sets where the utterances are sampled to set the ratio of ground-truth inclusion in enabled domains to be 70%, which is closer to the ratio for actual input traffic.

3.2 Results

Table 1 shows the accuracies of our proposed models on the two test sets. We also show mean reciprocal rank (MRR) and top-3, accuracy666Top-3 accuracy is calculated as # (utterances one of whose top three predictions is a ground-truth) / # (total utterances). which is meaningful when utilizing post reranker, but we do not cover reranking issues in this paper (Robichaud et al., 2014; Kim et al., 2018a).

From Table 1, we can first see that changing softmax attention to sigmoid attention significantly improves the performance. This means that having more expressive power for the domain enablement information by relaxing the softmax constraint is effective in terms of leveraging the domain enablement information for domain classification. Along with sigmoid attention, supervised attention leveraging ground-truth slightly improves the performance, and supervised attention combined with self-distillation shows significant performance improvement. It demonstrates that supervised domain enablement attention leveraging ground-truth enabled domains is helpful, and utilizing attention information from other enabled domains is synergistic.

Kim et al. (2018b)’s model also adds a domain enablement bias vector to the final output, which is helpful when the ground-truth domain is one of the enabled domains. Such models (5) and (6) also show good performance for the test set where the ground-truth is one of the enabled domains with more than 90% probability. However, for the unbiased test set where the ground-truth is included in the enabled domains with a smaller probability, not adding the bias vector is shown to be better overall.

Table 2 shows sample utterances correctly predicted with model (4) but not with model (1) and (2). For the first two utterances, the ground-truths are included in the enabled domains, but there were only hundreds or fewer training instances whose ground-truths are CryptoPrice or Expedia. In these cases, we can see that model (1) attends to unrelated domains, model (2) attends to none of the enabled domains, but model (4), which uses supervised attention, is shown to attend to the ground-truth even without many training examples. “find my phone” has a single enabled domain which is not a ground-truth. In this case, model (1) still fully attends to the unrelated domain because of softmax attention while model (2) and (4) do not highly attend to it so that the unrelated enabled domain is not impactive.

3.3 Implementation Details

The word vectors are initialized with off-the-shelf GloVe vectors (Pennington et al., 2014), and all the other model parameters are initialized with Xavier initialization (Glorot and Bengio, 2010). Each model is trained for 25 epochs and the parameters showing the best performance on the development set are chosen as the model parameters. We use ADAM (Kingma and Ba, 2015) for the optimization with the initial learning rate 0.0002 and the mini-batch size 128. We use gradient clipping, where the threshold is set to 5. We use a variant of LSTM, where the input gate and the forget gate are coupled and peephole connections are used (Gers and Schmidhuber, 2000; Greff et al., 2017). We also use variational dropout for the LSTM regularization (Gal and Ghahramani, 2016). All the models are implemented with DyNet (Neubig et al., 2017).

4 Conclusion

We have introduced a novel domain enablement attention mechanism improving domain classification performance utilizing domain enablement information more effectively. The proposed attention mechanism uses sigmoid attentions for more expressive power of the attention weights, supervised attention leveraging ground-truth information for explicit guidance of the attention weight training, and self-distillation for the attention supervision leveraging enabled domains that are not ground truth domains. Evaluating on utterances from real usage in a large-scale IPDA, we have demonstrated that our proposed model significantly improves domain classification performance by better utilizing domain enablement information.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aghajanyan (2017) Armen Aghajanyan. 2017. Softtarget regularization: An effective technique to reduce over-fitting in neural networks. In IEEE Conference on Cybernetics (CYBCONF) .
2Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR) .
3Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In EMNLP , pages 670–680.
4Furlanello et al. (2018) Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born Again Neural Networks. In International Conference on Machine Learning (ICML) , pages 1602–1611.
5Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 29 (NIPS) , pages 1019–1027.
6Gers and Schmidhuber (2000) Felix A. Gers and Jürgen Schmidhuber. 2000. Recurrent Nets that Time and Count. In IJCNN , volume 3, pages 189–194.
7Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 249–256.
8Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks , 18(5):602–610.