The Effect of Downstream Classification Tasks for Evaluating Sentence Embeddings
Peter Potash

TL;DR
This paper examines how the characteristics of label distributions in downstream classification tasks influence the effectiveness of sentence embeddings, highlighting the impact of label complexity on evaluation outcomes.
Contribution
It provides an analysis of how label distribution properties affect the evaluation of sentence embeddings in classification tasks, offering insights into the limitations of current evaluation methods.
Findings
Sentences with more labels across tasks have higher reconstruction loss.
Label distribution characteristics significantly influence embedding evaluation.
Evaluation effectiveness depends on the overall label distribution across sentences.
Abstract
One popular method for quantitatively evaluating the utility of sentence embeddings involves using them in downstream language processing tasks that require sentence representations as input. One simple such task is classification, where the sentence representations are used to train and test models on several classification datasets. We argue that by evaluating sentence representations in such a manner, the goal of the representations becomes learning a low-dimensional factorization of a sentence-task label matrix. We show how characteristics of this matrix can affect the ability for a low-dimensional factorization to perform as sentence representations in a suite of classification tasks. Primarily, sentences that have more labels across all possible classification tasks have a higher reconstruction loss, however the general nature of this effect is ultimately dependent on the overall…
Click any figure to enlarge with its caption.
Figure 1| Row Density | % Coverage | Avg Loss |
|---|---|---|
| 0.1% | 90 | |
| 1% | 9.9 | |
| 10% | 0.1 | |
| 0.1% | 33.3 | |
| 1% | 33.3 | |
| 10% | 33.3 | |
| 0.1% | 0.1 | |
| 1% | 9.9 | |
| 10% | 90 |
| # Dim. | Rep. | Density | CR | MR | MPQA | SUBJ | SST2 | SST5 | TREC |
| 40,000 | Binary | 50% | 66.01 | 75.07 | 79.57 | 73.66 | 99.95 | 45.7 | 30.6 |
| 4,000 | Binary | 50% | 83.15 | 99.57 | 97.5 | 99.74 | 100 | 99.64 | 91.2 |
| 400 | Binary | 50% | 99 | 99.99 | 98.38 | 99.99 | 100 | 100 | 98.8 |
| 40 | Binary | 50% | 99.92 | 100 | 97.73 | 100 | 100 | 100 | 100 |
| 40,000 | Binary | 1% | 91.02 | 91.67 | 97.86 | 99.86 | 100 | 99.95 | 98.2 |
| 4,000 | Binary | 1% | 88.95 | 99.82 | 97.86 | 99.86 | 100 | 99.73 | 100 |
| 400 | Binary | 1% | 99.65 | 100 | 98.15 | 100 | 100 | 100 | 100 |
| 40 | Binary | 1% | 99.31 | 100 | 97.73 | 100 | 100 | 100 | 99.6 |
| 4,000 | SVD | 50% | 63.76 | 61.62 | 72.51 | 60.58 | 99.89 | 29.5 | 19 |
| 400 | SVD | 50% | 63.76 | 54.34 | 68.77 | 52.23 | 98.35 | 24.62 | 19.2 |
| 40 | SVD | 50% | 63.76 | 50.63 | 68.77 | 50.58 | 97.47 | 23.08 | 18.6 |
| 4,000 | SVD | 1% | 63.76 | 99.88 | 73.5 | 99.93 | 100 | 97.24 | 98 |
| 400 | SVD | 1% | 63.76 | 99.81 | 68.77 | 99.84 | 100 | 100 | 96.6 |
| 40 | SVD | 1% | 63.76 | 99.98 | 68.77 | 99.8 | 100 | 51.72 | 90.4 |
| MTE (4096 Dim) | 88.3 | 82.8 | 91.3 | 94.2 | 84.5 | - | 94.2 | ||
| # classes | 2 | 2 | 2 | 2 | 2 | 5 | 6 | ||
| % Coverage | 5.7 | 15.7 | 15.7 | 14.3 | 100 | 17.1 | 8.6 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining
The Effect of Downstream Classification Tasks
for Evaluating Sentence Embeddings
Peter Potash
Microsoft Research Montreal
Abstract
One popular method for quantitatively evaluating the utility of sentence embeddings involves using them in downstream language processing tasks that require sentence representations as input. One simple such task is classification, where the sentence representations are used to train and test models on several classification datasets. We argue that by evaluating sentence representations in such a manner, the goal of the representations becomes learning a low-dimensional factorization of a sentence-task label matrix. We show how characteristics of this matrix can affect the ability for a low-dimensional factorization to perform as sentence representations in a suite of classification tasks. Primarily, sentences that have more labels across all possible classification tasks have a higher reconstruction loss, however the general nature of this effect is ultimately dependent on the overall distribution of labels across all possible sentences.
1 Introduction
Universal Sentence Embeddings (USEs) is a line of research that seeks to learn a mapping between token sequences and vectors in a real-valued space of a fixed dimension. The work of Conneau et al. (2017) has brought tremendous attention to the problem of USEs, not only because of the model they propose, but because of how they propose to test the utility of their learned representations. The authors propose to use their learned sentence embeddings to train and test classification models on a variety of datasets, while keeping the sentence representation model fixed – the suite of downstream tasks is called SentEval Conneau and Kiela (2018) by the authors.
The research community has come to embrace SentEval as the default method for evaluating sentence embeddings, unleashing numerous papers that evaluate with SentEval (c.f. Subramanian et al. (2018); Pagliardini et al. (2018); Logeswaran and Lee (2018); Zhang et al. (2018)). In our work, we seek to analyze the implications that arise from evaluating USEs using downstream classification tasks, namely by viewing USEs as a low-rank approximation of a sentence-task label matrix. Furthermore, we provide in an initial exploration of how the distribution of task labels across sentences could affect the ability to produce truly ‘universal’ representations that are effective for an arbitrary classification task111To be clear, we are not proposing a new methodology for USEs: we are analyzing a current evaluation methodology..
2 USEs for
Downstream Classification
Formally, USEs learn a mapping from a token sequence of arbitrary length to a fixed-length vector . When evaluated with SentEval, the s become the input representations for learning a logistic regression classifier for a given text classification dataset. Currently, USEs are real-valued vectors where the dimensionality rarely exceeds 10k. However, in light of this evaluation methodology, we posit that a ‘perfect’ USE would be a high-dimensional vector where each dimension corresponds to a certain label for a certain classification task222We make the assumption that there would be a finite set of valid classification tasks. Future work can extend our analysis to an infinite set of tasks, though it is not immediately clear that, theoretically speaking, this set would be infinite.. The vector would be all zeros unless it has a label for a class, and in which case that dimension has a value of one. Suppose a classification task’s -way classification correspond to dimensions , then the perfect classifier would be a vector with all zeros, except for ones in dimensions .
We argue that a sentence is valid if it’s corresponding vector has at least one non-zero entry. Furthermore, stacking all sentence vectors vertically together to form a matrix333Like with the set of task labels, we make the assumption that there is a finite set of sentences, which is a simplification of the problem for our analysis, since one can construct a sentence of ever-increasing length Weisler and Milekic (2000)., the columns now correspond to classification tasks, which we argue must have at least one non-zero entry to be a valid classification task. Furthermore, we do not allow duplicate rows/columns (sentences/task labels). We define this matrix as .
3 Low-Rank Approximation of USEs
Because the goal of USEs is to be applicable to all classification tasks, the lower-dimensional embeddings must approximate the matrix . The Eckart-Young-Mirsky theorem Eckart and Young (1936) states that the solution to this problem is in fact the singular value decomposition of . Moreover, the difference in norm between a matrix and a rank matrix , derived from its singular value decomposition, is related to the singular values of , , as follows:
[TABLE]
By construction of , it is full-rank and thus has all non-zero singular values. Therefore, there is guaranteed to be information loss when reconstructing the complete matrix with a lower-rank approximation. Note that is the optimal low-rank approximation, making the left-hand side of Equation 1 a lower bound on the reconstruction loss.
Suppose we would like to use this low-rank approximation as the representations for a logistic regression model to perform binary classification as to whether a sentence has a label. The column of the right-singular vector matrix , associated with that label for that task, will provide the best possible binary classifier for this objective, when attempting to generalize across all possible input sentences to the classifier. Such a classifier would have the smallest possible generalization error given USEs of a certain dimension.
4 Effect of Label Density
Given this connection between USEs and the low-rank approximation of , it is informative to understand the relationship between how labels are distributed across sentences and how much error there is using a low-rank approximation to estimate . We conduct experiment on a binary matrix, size 4,000 by 4,300, with each experimental variation modifying the density (percentage of non-zero elements). Rows can either be 0.1%, 1%, or 10% dense, and the number of rows with each type of density can either be even or skewed, with one type taking 90% of rows, another 9.9%, and the last only 0.1%. We factor the matrix using singular value decomposition (the distribution of the singular values are shown in Figure 1), and keep only the 40 first principal dimensions. We then reconstruct using this low-rank approximation, and calculate the l1 loss across each row. Finally, we average the row-wise loss across each density type. The results are shown in Table 1.
One of the most interesting observations is that when the most dense rows account for the least amount of rows in the matrix, it is actually the rows with a density of 1% that lose the most information. It is also interesting to see that the average losses comparing evenly distributed rows to ones skewed dense are almost identical. Lastly, and arguably most importantly, note that the sparsest rows lose the least amount of information.
5 Experiments
In order to further empirically determine the effect of various factorization scenarios with respect to sentence embeddings’ performance on downstream classification tasks, we ran several experiments on the SentEval classification tasks. Each scenario has three different attributes that describe it:
Rep.: We have two ways to create the representations in our experiments: 1) Start with with a high dimensional binary matrix, and factor it with SVD. The matrix we factor is always of size 70,000 X 120,000, where the first rows map to a unique data point in a given classificaiton dataset, and the first columns are used to determine if is of class 1,…,444We specifically include the task label information in the representation, since we are not evaluating a specific technique for USEs, but rather, how well USEs can encode the information contained in .. Consequently, none of the rows after have values in the first columns. The representations created this way are called SVD; 2) Use binary vectors directly. This is similar to the factored matrix from the SVD representation, except in this scenario the matrix strictly has rows.
# Dim: The dimensionality of the sentence embeddings. if the representation is SVD, this is the number of principal components that are kept. If the representation is Binary, this is the size of the binary vector representations.
Density: This is the probability that a given cell of the matrix has a one. For SVD, this refers to the binary matrix that is factored. For Binary, this is for the representations themselves.
Results of our experiments using various combinations of Rep, # Dim, and Density are listed in Table 2. We run our experiments on the classification tasks available in the SentEval suite (Conneau and Kiela, 2018). Specifically, we use the following datasets: sentiment analysis (MR and both binary and 5-class SST) (Pang and Lee, 2005; Socher et al., 2013), question-type (TREC) (Voorhees and Tice, 2000), subjectivity/objectivity (SUBJ) (Pang and Lee, 2004), and opinion polarity (MPQA) (Wiebe et al., 2005).
6 Results
The results of our experiments show that there are numerous factors that effect the performance of USEs for downstream classification tasks. For the binary Rep type, having a large, dense representation, even though it has indices reserved specifically for revealing the label on a given task, proves harder to learn a strong classifier, with the difficulty increased by more classes and fewer training examples. For the SVD Rep, given either a dense or sparse binary matrix that is factored, the performance decreases as the dimensionality is decreased. Also for the SVD Rep, we can seen that % Coverage has an important effect on performance, as the datasets with smaller coverage of the factored matrix often have lower performance.
For both Reps, the density of the matrix, either binary or factored, has a critical impact in performance across the datasets. This dictates that data points are harder to classify when they have more labels from all possible classification tasks, which aligns with the results from Section 4. This could potentially be further supported by examining the performance of the state-of-the-art USEs from Subramanian et al. (2018) (MTE). The tasks for which it performs the worst, CR, MR, and SST2, are very similar, common tasks, so it can be argued the sentences in one of the datasets are likely to have more labels overall across all possible classification tasks, and therefore will be harder to classify using USEs.
7 Other Downstream Evaluation
Tasks
Until this point, we have discussed the usage of downstream classification tasks for evaluating USEs. Alternatively, the task of Semantic Textual Similarity (STS) (Agirre et al., 2012) is another popular downstream task for evaluating USEs (Arora et al., 2017; Cer et al., 2018). This is an unsupervised transfer task, since the USE for a give sentence in the dataset is used directly without training, by taking the cosine similarity between two sentences’ USEs as a proxy STS score. Consider, now, if the USEs were those described in Section 2, which are constructed with the goal of having optimal transferability to an arbitrary classification task. The calculation of the cosine similarity between two USEs becomes:
[TABLE]
where is the number of labels assigned to sentences multiplied together. Thus, two sentences will have a higher cosine similarity when they share more labels across classification tasks. This result provides an interesting definition for STS: two sentences have a higher similarity score if they share more labels from classification tasks.
A third type of downstream task is classification on input sentence pairs, using tasks such as paraphrase detection Dolan et al. (2004) and natural language inference Bowman et al. (2015). When using paired text as input in SentEval, the engine concatenates the pointwise multiplication of the sentences vectors with the absolute value of their difference. Using the representations from Section 2, the concatenated vectors would be binary negations of each other, highlighting the labels that are shared between the sentences.
8 Conclusion
In this work we have examined the consequences of evaluating Universal Sentence Embeddings on downstream classification tasks by viewing Universal Sentence Representations as a low-rank approximation of the sentence-task label matrix. Our work shows that sentences with fewer labels across all tasks have smaller reconstruction loss, and are therefore easier to compress in a low-dimensional vector space. Furthermore, depending on the amount of such sentences, either the most dense or the moderately dense will have the highest loss. Future work should seek to provide a better theoretical distribution of labels across the sentence-task label matrix, as well as extend analysis to the infinite-dimensional case.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Agirre et al. (2012) Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation , pages 385–393. Association for Computational Linguistics.
- 2Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In International Conference of Learning Representations .
- 3Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 632–642.
- 4Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. In Empirical Methods of Natural Language Processing (System Demonstration) .
- 5Conneau and Kiela (2018) Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) .
- 6Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 670–680.
- 7Dolan et al. (2004) Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics , page 350. Association for Computational Linguistics.
- 8Eckart and Young (1936) Carl Eckart and Gale Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika , 1(3):211–218.
