Supervised Domain Enablement Attention for Personalized Domain Classification
Joo-Kyung Kim, Young-Bum Kim

TL;DR
This paper introduces a supervised enablement attention mechanism for personalized domain classification that uses sigmoid activation and supervised learning to enhance attention expressiveness and improve classification accuracy.
Contribution
It proposes a novel supervised enablement attention method utilizing sigmoid activation and self-distillation, advancing personalized domain classification techniques.
Findings
Significant improvement in domain classification accuracy.
Effective use of supervised attention with sigmoid activation.
Enhanced expressiveness of attention weights.
Abstract
In large-scale domain classification for natural language understanding, leveraging each user's domain enablement information, which refers to the preferred or authenticated domains by the user, with attention mechanism has been shown to improve the overall domain classification performance. In this paper, we propose a supervised enablement attention mechanism, which utilizes sigmoid activation for the attention weighting so that the attention can be computed with more expressive power without the weight sum constraint of softmax attention. The attention weights are explicitly encouraged to be similar to the corresponding elements of the ground-truth's one-hot vector by supervised attention, and the attention information of the other enabled domains is leveraged through self-distillation. By evaluating on the actual utterances from a large-scale IPDA, we show that our approach…
Click any figure to enlarge with its caption.
Figure 1| Model no | Attention method | Biased ground-truth inclusion | Unbiased ground-truth inclusion | ||||
|---|---|---|---|---|---|---|---|
| Top1 | MRR | Top3 | Top1 | MRR | Top3 | ||
| (1) | sfm | 95.81 | 97.27 | 99.08 | 90.65 | 93.60 | 97.31 |
| (2) | sgmd | 95.98 | 97.43 | 99.19 | 91.03 | 93.92 | 97.49 |
| (3) | sgmd, spvs | 96.10 | 97.50 | 99.21 | 91.11 | 93.98 | 97.53 |
| (4) | sgmd, spvs, sdst | 96.29 | 97.65 | 99.32 | 91.33 | 94.14 | 97.62 |
| (5) | sfm, bias | 97.01 | 98.26 | 99.75 | 90.07 | 93.03 | 96.84 |
| (6) | sgmd, spvs, sdst, bias | 97.48 | 98.51 | 99.76 | 90.58 | 93.30 | 96.73 |
| Utterance | Ground-truth | Enabled domain: [attention weights for model (1), (2), and (4)], … | ||
|---|---|---|---|---|
| what is the price of bitcoin | Crypto Price |
|
||
| find me a round trip ticket flight | Expedia | Expedia: [0.0048, 5.37e-08, 0.6205], KAYAK: [0.9952, 0.0004, 0.461] | ||
| find my phone | Find My Phone | The Name Game: [1.0, 0.0001, 0.1677] |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
Supervised Domain Enablement Attention for Personalized Domain Classification
Joo-Kyung Kim Young-Bum Kim
Amazon Alexa
{jookyk,youngbum}@amazon.com
Abstract
In large-scale domain classification for natural language understanding, leveraging each user’s domain enablement information, which refers to the preferred or authenticated domains by the user, with attention mechanism has been shown to improve the overall domain classification performance. In this paper, we propose a supervised enablement attention mechanism, which utilizes sigmoid activation for the attention weighting so that the attention can be computed with more expressive power without the weight sum constraint of softmax attention. The attention weights are explicitly encouraged to be similar to the corresponding elements of the ground-truth’s one-hot vector by supervised attention, and the attention information of the other enabled domains is leveraged through self-distillation. By evaluating on the actual utterances from a large-scale IPDA, we show that our approach significantly improves domain classification performance.
1 Introduction
Due to recent advances in deep learning techniques, intelligent personal digital assistants (IPDAs) such as Amazon Alexa, Google Assistant, Microsoft Cortana, and Apple Siri have been widely used as real-life applications of natural language understanding (Sarikaya et al., 2016; Sarikaya, 2017).
In natural language understanding, domain classification is a task that finds the most relevant domain given an input utterance (Tur and de Mori, 2011). For example, “make a lion sound” and “find me an apple pie recipe” should be classified as ZooKeeper and AllRecipe, respectively. Recent IPDAs cover more than several thousands of diverse domains by including third-party developed domains such as Alexa Skills (Kumar et al., 2017; Kim et al., 2018a; Kim and Kim, 2018), Google Actions, and Cortana Skills, which makes domain classification to be a more challenging task.
Given a large number of domains, leveraging user’s enabled domain information111Enabled domain information specifically refers to preferred or authenticated domains in Amazon Alexa, but it can be extended to other information such as the list of recently used domains. has been shown to improve the domain classification performance since enabled domains reflect the user’s context in terms of domain usage (Kim et al., 2018b). For an input utterance, Kim et al. (2018b) use attention mechanism so that a weighted sum of the enabled domain vectors are used as an input signal as well as the utterance vector. The enabled domain vectors and the attention weights are automatically trained in an end-to-end fashion to be helpful for the domain classification.
In this paper, we propose a supervised enablement attention mechanism for more effective attention on the enabled domains. First, we use logistic sigmoid instead of softmax as the attention activation function to relax the constraint that the weight sum over all the enabled domains is 1 to the constraint that each attention weight is between 0 and 1 regardless of the other weights (Martins and Astudillo, 2016; Kim et al., 2017). Therefore, all the attention weights can be very low if there are no enabled domains relevant to a ground-truth so that we can disregard the irrelevant enabled domains, and multiple attention weights can have high values when multiple enabled domains are helpful for disambiguating an input utterance. Second, we encourage each attention weight to be high if the corresponding enabled domain is a ground-truth domain and if otherwise, to be low, by a supervised attention method (Mi et al., 2016) so that the attention weights can be directly tuned for the downstream classification task. Third, we apply self-distillation (Furlanello et al., 2018) on top of the enablement attention weights so that we can better utilize the enabled domains that are not ground-truth domains but still relevant.
Evaluating on datasets obtained from real usage in a large-scale IPDA, we show that our approach significantly improves domain classification performance by utilizing the domain enablement information effectively.
2 Model
Figure 1 shows the overall architecture of the proposed model.
Given an input utterance, each word of the utterance is represented as a dense vector through word embedding followed by bidirectional long short-term memory (BiLSTM) (Graves and Schmidhuber, 2005). Then, an utterance vector is composed by concatenating the last outputs of the forward LSTM and the backward LSTM.222We have also evaluated word vector summation, CNN (Kim, 2014), BiLSTM mean-pooling, and BiLSTM max-pooling (Conneau et al., 2017) as alternative utterance representation methods, but they did not show better performance on our task.
To represent the domain enablement information, we obtain a weighted sum of domain enablement vector where the weights are calculated by logistic sigmoid function on top of the multiplicative attention (Luong et al., 2015) for the utterance vector and the domain enablement vectors. The attention weight of an enabled domain is formulated as follows:
[TABLE]
where is the utterance vector, is the enablement vector of enabled domain , and is sigmoid function. Compared to conventional attention mechanism using softmax function, which constraints the sum of the attention weights to be 1, sigmoid attention has more expressive power, where each attention weight can be between 0 and 1 regardless of the other weights. We show that using sigmoid attention is actually more effective for improving prediction performance in Section 3.
The utterance vector and the weighted sum of the domain enablement vectors are concatenated to represent the utterance and the domain enablement as a single vector. Given the concatenated vector, a feed-forward neural network with a single hidden layer333We utilize scaled exponential linear units (SeLU) as the activation function for the hidden layer(Klambauer et al., 2017). is used to predict the confidence score by logistic sigmoid function for each domain.
One issue of the proposed architecture is that the domain enablement can be trained to be a very strong signal, where one of the enabled domains would be the predicted domains regardless of the relevancy of the utterances to the predicted domains in many cases. To reduce this prediction bias, we use randomly sampled enabled domains instead of the correct enabled domains of an input utterance with 50% probability during training so that the domain enablement is used as an auxiliary signal rather than determining signal. During inference, we always use the correct domain enablements of the given utterances.
The main loss function of our model is formulated as binary log loss between the confidence score and the ground-truth vector as follows:
[TABLE]
where is the number of all domains, is an -dimensional confidence score vector from the model, and is an -dimensional one-hot vector whose element corresponding to the position of the ground-truth domain is set to 1.
2.1 Supervised Enablement Attention
Attention weights are originally intended to be automatically trained in an end-to-end fashion (Bahdanau et al., 2015), but it has been shown that applying proper explicit supervision to the attention improves the downstream tasks such as machine translation given the word alignment and constituent parsing given annotations between surface words and nonterminals (Mi et al., 2016; Liu et al., 2016; Kamigaito et al., 2017).
We hypothesize that if the ground-truth domain is one of the enabled domains, the attention weight for the ground-truth domain should be high and vice versa. To apply this hypothesis in the model training as a supervised attention method, we formulate an auxiliary loss function as follows:
[TABLE]
where is a set of enabled domains and is the attention weight for the enabled domain .
2.2 Self-Distilled Attention
One issue of supervised attention in Section 2.1 is that enabled domains that are not ground-truth domains are encouraged to have lower attention weights regardless of their relevancies to the input utterances and the ground-truth domains. Distillation methods utilize not only the ground-truth but also all the output activations of a source model so that all the prediction information from the source model can be utilized for more effective knowledge transfer between the source model and the target model (Hinton et al., 2014). Self-distillation, which trains a model leveraging the outputs of the source model with the same architecture or capacity, has been shown to improve the target model’s performance with a distillation method (Furlanello et al., 2018).
We use a variant of self-distillation methods, where the model outputs at the previous epoch with the best dev set performance are used as the soft targets for the distillation,444This approach is closely related to Temporal Ensembling (Laine and Aila, 2017), but we just leverage the model outputs at the previous epoch rather than accumulating the outputs over multiple epochs. so that the enabled domains that are not ground-truths can also be used for the supervised attention. While conventional distillation methods utilize softmax activations as the target values, we show that distillation on top of sigmoid activations is also effective without loss of generality. The loss function for the self-distillation on the attention weights is formulated as follows:
[TABLE]
where is the attention weight of the model showing the dev set performance in the previous epochs. It is formulated as:
[TABLE]
where is the temperature for sufficient usage of all the attention weights as the soft target. In this work, we set to be 16, which shows the best dev set performance.
We have also evaluated soft-target regularization (Aghajanyan, 2017), where a weighted sum of the hard ground-truth target vector and the soft target vector is used as a single target vector, but it did not show better performance than self-distillation.
All the described loss functions are added to compose a single loss function as follows:
[TABLE]
where is a coefficient representing the degree of supervised enablement attention and denotes the degree of the self-distillation. We set to be 0.01 in this work. Following Hu et al. (2016), , where denotes the current training epoch starting from 0 so that the hard ground-truth targets are more influential in the early epochs and the self-distillation is more utilized in the late epochs.
3 Experiments
We evaluate our proposed model on domain classification leveraging enabled domains. The enabled domains can be a crucial disambiguating signal especially when there are multiple similar domains. For example, assume that the input utterance is “what’s the weather” and there are multiple weather-related domains such as NewYorkWeather, AccuWeather, and WeatherChannel. In this case, if WeatherChannel is included as an enabled domain of the current user, it is likely to be the most relevant domain to the user.
3.1 Datasets
Following the data collection methods used in Kim et al. (2018b), our models are trained using utterances with explicit invocation patterns. For example, given a user’s utterance, “Ask {ZooKeeper} *to *{play peacock sound},” “play peacock sound” and ZooKeeper are extracted to compose a pair of the utterance and the ground-truth, respectively. In this way, we have generated train, development, and test sets containing 4.4M, 500K, and 500K utterances, respectively. All the utterances are from the usage log of Amazon Alexa and the ground-truth of each utterance is one of 1K frequently used domains. The average number of enabled domains per utterance in the test sets is 8.47.
One issue of this collected data sets is that the ground-truth is included in the enabled domains for more than 90% of the utterances, where the ground-truths are biased to enabled domains.555Since the data collection method leverages utterances where users already know the exact domain names, such domains are likely to be the enabled domains of the users. For more correct and unbiased evaluation of the models on the input utterances from real live traffic, we also evaluate the models on the same sized train, development, and test sets where the utterances are sampled to set the ratio of ground-truth inclusion in enabled domains to be 70%, which is closer to the ratio for actual input traffic.
3.2 Results
Table 1 shows the accuracies of our proposed models on the two test sets. We also show mean reciprocal rank (MRR) and top-3, accuracy666Top-3 accuracy is calculated as # (utterances one of whose top three predictions is a ground-truth) / # (total utterances). which is meaningful when utilizing post reranker, but we do not cover reranking issues in this paper (Robichaud et al., 2014; Kim et al., 2018a).
From Table 1, we can first see that changing softmax attention to sigmoid attention significantly improves the performance. This means that having more expressive power for the domain enablement information by relaxing the softmax constraint is effective in terms of leveraging the domain enablement information for domain classification. Along with sigmoid attention, supervised attention leveraging ground-truth slightly improves the performance, and supervised attention combined with self-distillation shows significant performance improvement. It demonstrates that supervised domain enablement attention leveraging ground-truth enabled domains is helpful, and utilizing attention information from other enabled domains is synergistic.
Kim et al. (2018b)’s model also adds a domain enablement bias vector to the final output, which is helpful when the ground-truth domain is one of the enabled domains. Such models (5) and (6) also show good performance for the test set where the ground-truth is one of the enabled domains with more than 90% probability. However, for the unbiased test set where the ground-truth is included in the enabled domains with a smaller probability, not adding the bias vector is shown to be better overall.
Table 2 shows sample utterances correctly predicted with model (4) but not with model (1) and (2). For the first two utterances, the ground-truths are included in the enabled domains, but there were only hundreds or fewer training instances whose ground-truths are CryptoPrice or Expedia. In these cases, we can see that model (1) attends to unrelated domains, model (2) attends to none of the enabled domains, but model (4), which uses supervised attention, is shown to attend to the ground-truth even without many training examples. “find my phone” has a single enabled domain which is not a ground-truth. In this case, model (1) still fully attends to the unrelated domain because of softmax attention while model (2) and (4) do not highly attend to it so that the unrelated enabled domain is not impactive.
3.3 Implementation Details
The word vectors are initialized with off-the-shelf GloVe vectors (Pennington et al., 2014), and all the other model parameters are initialized with Xavier initialization (Glorot and Bengio, 2010). Each model is trained for 25 epochs and the parameters showing the best performance on the development set are chosen as the model parameters. We use ADAM (Kingma and Ba, 2015) for the optimization with the initial learning rate 0.0002 and the mini-batch size 128. We use gradient clipping, where the threshold is set to 5. We use a variant of LSTM, where the input gate and the forget gate are coupled and peephole connections are used (Gers and Schmidhuber, 2000; Greff et al., 2017). We also use variational dropout for the LSTM regularization (Gal and Ghahramani, 2016). All the models are implemented with DyNet (Neubig et al., 2017).
4 Conclusion
We have introduced a novel domain enablement attention mechanism improving domain classification performance utilizing domain enablement information more effectively. The proposed attention mechanism uses sigmoid attentions for more expressive power of the attention weights, supervised attention leveraging ground-truth information for explicit guidance of the attention weight training, and self-distillation for the attention supervision leveraging enabled domains that are not ground truth domains. Evaluating on utterances from real usage in a large-scale IPDA, we have demonstrated that our proposed model significantly improves domain classification performance by better utilizing domain enablement information.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aghajanyan (2017) Armen Aghajanyan. 2017. Softtarget regularization: An effective technique to reduce over-fitting in neural networks. In IEEE Conference on Cybernetics (CYBCONF) .
- 2Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR) .
- 3Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In EMNLP , pages 670–680.
- 4Furlanello et al. (2018) Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born Again Neural Networks. In International Conference on Machine Learning (ICML) , pages 1602–1611.
- 5Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 29 (NIPS) , pages 1019–1027.
- 6Gers and Schmidhuber (2000) Felix A. Gers and Jürgen Schmidhuber. 2000. Recurrent Nets that Time and Count. In IJCNN , volume 3, pages 189–194.
- 7Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 249–256.
- 8Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks , 18(5):602–610.
