NoPPA: Non-Parametric Pairwise Attention Random Walk Model for Sentence   Representation

Xuansheng Wu; Zhiyi Zhao; Ninghao Liu

arXiv:2302.12903·cs.CL·February 28, 2023

NoPPA: Non-Parametric Pairwise Attention Random Walk Model for Sentence Representation

Xuansheng Wu, Zhiyi Zhao, Ninghao Liu

PDF

Open Access 1 Repo

TL;DR

NoPPA is a non-parametric, trainless sentence embedding model that leverages pre-trained word embeddings and word frequencies, outperforming traditional bag-of-words methods and rivaling state-of-the-art non-parametric approaches.

Contribution

This study introduces the first non-parametric attention mechanism that breaks the bag-of-words assumption for sentence representation.

Findings

01

Outperforms all bag-of-words-based methods on eight classification tasks.

02

Provides comparable or better performance than existing non-parametric methods.

03

Visualizations show understanding of topics, phrases, and causalities.

Abstract

We propose a novel non-parametric/un-trainable language model, named Non-Parametric Pairwise Attention Random Walk Model (NoPPA), to generate sentence embedding only with pre-trained word embedding and pre-counted word frequency. To the best we know, this study is the first successful attempt to break the constraint on bag-of-words assumption with a non-parametric attention mechanism. We evaluate our method on eight different downstream classification tasks. The experiment results show that NoPPA outperforms all kinds of bag-of-words-based methods in each dataset and provides a comparable or better performance than the state-of-the-art non-parametric methods on average. Furthermore, visualization supports that NoPPA can understand contextual topics, common phrases, and word causalities. Our model is available at https://github.com/JacksonWuxs/NoPPA.

Tables2

Table 1. Table 1: Sentence embedding performance on downstream tasks.

Model	MR	CR	SUBJ	MPQA	SST2	TREC	MRPC	SICK-E	Avg
FastSent	70.8	78.4	88.7	80.6		76.8	72.2		77.9
USE (DAN)	74.0	80.5	91.9	83.5	80.3	89.6	71.8	80.4	81.5
Sent2Vec	75.8	80.3	91.1	85.9		86.4	72.5		82.0
SBERT	83.6	89.4	94.4	89.9	89.0	89.6	76.0		87.4
GloVe-avg*	77.4	80.0	91.6	87.8	82.2	84.0	73.2	79.2	81.9
SIF*	77.6	78.7	91.3	87.3	82.3	79.6	71.9	75.1	80.5
TFIDF*	77.5	79.5	91.9	87.7	81.7	83.6	74.0	79.3	81.9
VLAWE	77.7	79.2	91.7	88.1	80.8	87.0	72.8	81.2	82.3
DCT*	78.3	80.0	92.5	88.2	82.5	88.6	73.6	81.8	83.2
S3E*	77.9	79.6	91.5	87.0	82.6	83.8	73.9	78.4	81.8
GEM	78.8	81.1	93.1	89.4	83.6	88.6	73.4	85.3	84.2
CE-avg	77.4	80.1	92.4	88.0	81.9	87.2	73.6	81.5	82.7
CE-avg+NR	77.7	80.4	92.4	88.1	83.4	87.2	73.6	81.5	83.1
CE+SFW	77.9	80.2	92.4	88.2	82.9	88.0	74.1	81.2	83.2
NoPPA	${78.0}_{\pm 0.12}$	${80.5}_{\pm 0.15}$	${92.9}_{\pm 0.06}$	${88.4}_{\pm 0.03}$	${84.1}_{\pm 0.12}$	${88.2}_{\pm 0.22}$	${74.7}_{\pm 0.21}$	${81.6}_{\pm 0.16}$	$83.6$

Table 2. Table 2: Best hyper-parameters for different datasets

Seed	Dataset	$a$	$k$	test-acc
1034	MR	0.05	22	78.27
	SST2	0.1	2	84.29
	SUBJ	0.03	21	92.90
	MPQA	0.1	6	88.35
	CR	0.1	11	80.43
	SICK-E	0.03	21	81.83
	MRPC	0.1	6	74.84
	TREC	0.1	16	88.00
1314	MR	0.04	15	77.96
	SST2	0.1	4	84.13
	SUBJ	0.02	9	92.91
	MPQA	0.01	5	88.38
	CR	0.04	15	80.34
	SICK-E	0.01	9	81.55
	MRPC	0.1	1	74.67
	TREC	0.06	1	88.20
20220505	MR	0.01	15	78.04
	SST2	0.1	5	83.91
	SUBJ	0.03	7	92.8
	MPQA	0.02	3	88.31
	CR	0.02	15	80.72
	SICK-E	0.01	14	81.57
	MRPC	0.01	5	74.55
	TREC	0.1	7	88.0
20220508	MR	0.1	12	77.93
	SST2	0.07	2	84.07
	SUBJ	0.02	5	92.84
	MPQA	0.03	11	88.39
	CR	0.09	2	80.53
	SICK-E	0.03	2	81.71
	MRPC	0.02	2	74.38
	TREC	0.03	13	88.20
20220904	MR	0.02	12	77.99
	SST2	0.07	6	84.13
	SUBJ	0.08	12	92.96
	MPQA	0.1	3	88.4
	CR	0.1	14	80.61
	SICK-E	0.07	22	81.37
	MRPC	0.1	14	74.96
	TREC	0.06	17	88.60

Equations38

P (w ∣ c_{t}) \propto S imi l a r i t y (c_{t}, v_{w}) .

P (w ∣ c_{t}) \propto S imi l a r i t y (c_{t}, v_{w}) .

P c (w_{i} ∣ c_{s}) = j = 1 \sum n A_{ij} D_{ij} (c_{s}),

P c (w_{i} ∣ c_{s}) = j = 1 \sum n A_{ij} D_{ij} (c_{s}),

D_{ij} (c_{s})

D_{ij} (c_{s})

Z_{c} (v_{i})

P (w_{i} ∣ c_{s})

P (w_{i} ∣ c_{s})

= α P r (w_{i}) + (1 - α) j = 1 \sum n A_{ij} D_{ij} (c_{s}),

P (s) = n i = 1 \prod n P (w_{i} ∣ c_{s}) .

P (s) = n i = 1 \prod n P (w_{i} ∣ c_{s}) .

F_{s}

F_{s}

F_{w_{i}} = lo g [α P r (w_{i}) + (1 - α) j = 1 \sum n A_{ij} D_{ij}] .

F_{w_{i}} = lo g [α P r (w_{i}) + (1 - α) j = 1 \sum n A_{ij} D_{ij}] .

\frac{\partial F _{w_{i}}}{\partial c _{s}}

\frac{\partial F _{w_{i}}}{\partial c _{s}}

\frac{\partial cos ( v _{ij} , c )}{\partial c}

F_{w_{i}} (c_{s})

F_{w_{i}} (c_{s})

\approx C + \frac{a}{π ( P r ( w ) + \frac{a}{2} )} j = 1 \sum n A_{ij} v_{ij},

\tilde{c}_{s}

\tilde{c}_{s}

\propto \frac{1}{n} w_{i} \in s \sum \frac{a}{P r ( w _{i} ) + \frac{a}{2}} j = 1 \sum n A_{ij} v_{ij} .

v_{i}^{'} = WordEmbed (w_{i}) + PosEmbed (i) .

v_{i}^{'} = WordEmbed (w_{i}) + PosEmbed (i) .

P os E mb e d (i, 2 m)

P os E mb e d (i, 2 m)

P os E mb e d (i, 2 m + 1)

v_{ij} = [v_{i}; lo g_{2} (1 + (v_{j}^{'} - v_{i}^{'})^{2})],

v_{ij} = [v_{i}; lo g_{2} (1 + (v_{j}^{'} - v_{i}^{'})^{2})],

A_{ij} = so f t ma x (\frac{v _{i}^{'} v _{j}^{'⊤}}{d}),

A_{ij} = so f t ma x (\frac{v _{i}^{'} v _{j}^{'⊤}}{d}),

X = U S V,

X = U S V,

\overset{c}{^}_{s} = \tilde{c}_{s} - \tilde{c}_{s} V_{k}^{T} V_{k} .

\overset{c}{^}_{s} = \tilde{c}_{s} - \tilde{c}_{s} V_{k}^{T} V_{k} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jacksonwuxs/noppa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques

Full text

NoPPA: Non-Parametric Pairwise Attention Random Walk Model

for Sentence Representation

Xuansheng Wu

University of Georgia

210 S Jackson Street, Athens

Georgia, United States

[email protected]

&Zhiyi Zhao

Tufts University

161 College Avenue, Medford

Oregon, United States

[email protected]

&Ninghao Liu

University of Georgia

210 S Jackson Street, Athens

Georgia, United States

[email protected]

Abstract

We propose a novel non-parametric/un-trainable language model, named Non-Parametric Pairwise Attention Random Walk Model (NoPPA), to generate sentence embedding only with pre-trained word embedding and pre-counted word frequency. To the best we know, this study is the first successful attempt to break the constraint on bag-of-words assumption with a non-parametric attention mechanism. We evaluate our method on eight different downstream classification tasks. The experiment results show that NoPPA outperforms all kinds of bag-of-words-based methods in each dataset and provides a comparable or better performance than the state-of-the-art non-parametric methods on average. Furthermore, visualization supports that NoPPA can understand contextual topics, common phrases, and word causalities. Our model is available at https://github.com/JacksonWuxs/NoPPA.

1 Introduction

Precisely representing sentence-level semantic information is a cornerstone in widely natural language understanding tasks. In the era of pre-trained language models (Devlin et al., 2018; Floridi and Chiriatti, 2020), researchers proposed various training strategies (Reimers and Gurevych, 2019; Barkan et al., 2020; Wu et al., 2021; Cheng, 2021; Jiang et al., 2022), post-processing procedures (Li et al., 2020; Huang et al., 2021), and pooling methods (Reimers and Gurevych, 2019; Wang and Kuo, 2020) to generate high-quality sentence embedding from such large-scale models. However, the large numbers of parameters inside pre-trained models require huge computation recourse, which might not always be accessible in all scenarios.

To develop low-resource models, some researchers choose another direction to encode sentence embedding by using simpler language models and pre-trained word embedding (Mikolov et al., 2013; Pennington et al., 2014; Wieting et al., 2015a). The most straight forward way in this direction is weighted averaging word embedding (Wieting et al., 2015b; Arora et al., 2017; Ethayarajh, 2018), which essentially is a bag-of-words model under an independent assumption of words. These methods are simple, while their performance suffers from bag-of-words assumption seriously. Enhancing word embedding by combining multi-resources word embedding (Mekala et al., 2016; Rücklé et al., 2018) was a popular strategy to improve bag-of-words models in the early days. Recent studies usually remove the bag-of-words assumption to improve such simple language model by capturing time information (Kayal and Tsatsaronis, 2019; Almarwani et al., 2019) and using semantic subspace analysis (Ionescu and Butnaru, 2019; Wang et al., 2021).

In this paper, we propose the Non-Parametric Pairwise Attentive Random Walk Model (NoPPA), which boosts the modeling level of bag-of-words models from span N-grams to pairwise Bi-grams with non-parametric attention. Generally, NoPPA estimates the probability of a word appearing in a given sentence based on the non-contextual probability of words, the fluency of the word in the current sentence, and the likelihood that the word expresses the intent of the sentence. We further prove that the estimation of sentence embedding coming from NoPPA is weighted averaged pairwise word embedding. Specifically, we calculate the pairwise word embedding by a non-linear activation function over the difference of the word pairs. We evaluate our method on eight different text classification tasks. Results show that NoPPA outperforms all weighted-averaging-based methods and most of state-of-the-arts non-parametric methods, including time-information-infused and latent-space-analysis-based methods. Visual analysis shows that NoPPA dynamically adjusts the contributions of words to sentence embedding respecting different contexts, while the proposed non-parametric pairwise attention captures common multi-word phrases and causation between words.

We organize this paper as follows. In Section 2, we first reviews some recent studies in sentence embedding, including parametric and non-parametric methods. Then, we introduce our method in Section 3. Next, we evaluate NoPPA on eight tasks and report results in Section 4. The deeper analysis to NoPPA is described in Section 5. Finally, we summarize our work in Section 6 and potential limitations in Section 7.

2 Related Work

Recent studies in sentence embedding can be divided into two categories according to whether there are trainable parameters.

There was a long history to capture sentence embedding using trainable models. Parametric methods usually train their language model with supervised tasks, and sentence embedding is an additional product. Most parametric models (Kiros et al., 2015; Conneau et al., 2017; Logeswaran and Lee, 2018; Peters et al., 2018) are powered by recurrence neural networks (such as RNN, LSTM, and GRU) at the early stage. In contrast with the above methods, Sent2Vec (Pagliardini et al., 2017) is an unsupervised method to learn sentence embedding with n-gram features. After 2017, different BERT-based methods (cer2018universal; Reimers and Gurevych, 2019; Wang and Kuo, 2020; Li et al., 2020; Gao et al., 2021; Su et al., 2021) were designed empowered by self-supervised learning from large-scale unlabeled corpora.

Non-parametric methods choose a more straightforward way that heavily relies on high-quality pre-trained word vectors (Wieting et al., 2015b; Mikolov et al., 2013; Pennington et al., 2014; Joulin et al., 2016; Salle et al., 2016). Individually weighted averaging each word embedding is the easiest way (Wieting et al., 2015b; Arora et al., 2017; Ethayarajh, 2018; Yang et al., 2018) based on the bag-of-words model. There were some attempts to remove the assumption of ignoring word orders from the bag-of-words models by capturing time information in the signal domain (Kayal and Tsatsaronis, 2019; Almarwani et al., 2019). Combining multi-resource word embedding (Mekala et al., 2016; Rücklé et al., 2018) and semantic subspace analysis (Wang et al., 2021; Ionescu and Butnaru, 2019) are methods of enhancing the original static word embedding.

Normally, parametric models are expected to be better than non-parametric models. However, we need non-parametric methods in some scenarios where heavy computing requirements cannot be tolerated. Although the performance of recent non-parametric models has gradually improved, their computing complexity has also increased significantly, violating the purpose of designing them.

3 Pairwise Attentive Random Walk Model for Sentence Embedding

The most significant limitation of bag-of-words models is treating each word equally across the whole corpus. This assumption is against human intuition that people change the meaning of a word respecting the surrounding context. Since the self-attention mechanism can capture surrounding word information well, combining the bag-of-words model and the attention mechanism should improve the vanilla bag-of-words model.

In this section, we first describe our pairwise attentive random walk model in Section 1. Then we formalize the sentence embedding of the proposed language model in Section 3.2. Our designs of non-parameters attention mechanisms to integrate contextual information is discussed in Section 3.3. We cover the method used to remove the error introduced by the Taylor expansion in Section 3.4. We finally analyze the time complexity in Section 3.5. We summarize our method at Algorithm 1.

3.1 Pairwise Attentive Random Walk Model

The latent variable generative model Arora et al. (2016) treats the corpus generation as a dynamic process. The process is driven by the random walk of a discourse vector $c_{t}\in\mathbb{R}^{d}$ at the time $t$ , and each word $w$ in the vocabulary has a vector $v_{w}\in\mathbb{R}^{d}$ . Both $c_{t}$ and $v_{w}$ are latent variables. The discourse vector represents the intent of the speaker. Thus, the probability of observing a word $w$ at time $t$ is:

[TABLE]

The discourse vector $c_{t}$ does a slow random walk during the generation so that a single discourse embedding $c_{s}$ can replace all the $c_{t}$ in the sentence $s=\{w_{1},w_{2},...,w_{n}\}$ where $n$ is sentence length.

In this work, we extend the random walk process from the uni-gram word to bi-gram word pairs because we notice that words could have different meanings in different contexts, and some words always appear together with the others to form phrases. Thus, we assume that each pair of words $w_{i}$ and $w_{j}$ in the vocabulary has an embedding $v_{ij}\in\mathbb{R}^{d}$ and define the contextual probability $Pc$ of observing a word $w_{i}$ in a sentence $s$ as:

[TABLE]

where

[TABLE]

We first measure the similarity $D_{ij}(c_{s})\in\mathbb{R}$ between the word pair embedding $v_{ij}$ and the discourse embedding $c_{s}$ to calculate the contextual probability $Pc(w_{i}|c_{s})$ . We also evaluate the probability of a word pair $A_{ij}\in[0,1]$ in line with language fluency. Then we have a condition that $\sum_{j=1}^{n}A_{ij}=1,i=1,2,...,n$ by assuming each word can only be triggered by one surrounding word. In this study, we use the angular distance between $v_{ij}$ and $c_{s}$ to measure their similarity. That is $d(v_{ij},c_{s})=1-\frac{\arccos\cos(v_{ij},c_{s})}{\pi}$ .

To be more realistic, we consider the probability that a word appearing with a given discourse embedding $c_{s}$ is affected by the non-contextual/global probability $Pr(w_{i})$ and the contextual/local probability $Pc(w_{i}|c_{s})$ . Thus, we measure the probability of observing a word $w_{i}$ as follows:

[TABLE]

where $\alpha$ is a scalar and $Pr(w_{i})$ is the uni-gram probability of word $w_{i}$ that appears in the corpus.

With Equation (4), we can define the probability of observing a sentence $s$ normalized with the sentence length $n$ as follows:

[TABLE]

3.2 Sentence Embedding Estimation

We treat the Maximum Log-Likelihood Estimation (MLE) of $c_{s}$ from Equation (5) as the sentence embedding from the model. The log-likelihood of the sentence can be formalized as

[TABLE]

where

[TABLE]

Maximizing $F_{s}$ equals maximizing $F_{w_{i}}$ . We can approximate $F_{w_{i}}$ using a first-degree Taylor expansion to simplify the calculation. We borrow the fundamental assumption proposed by Arora et al. (2016) that the pair of words embedding $v_{ij}$ is roughly uniformly distributed in the latent space so that $Z_{c}$ can be seen as a constant. Thus, the first derivative of $F_{w_{i}}$ is

[TABLE]

Assume that we can find a vector $v_{0}$ is orthogonal to any $v_{ij}$ with length $\frac{1}{||v_{ij}||}$ . The approximation of $F_{w_{i}}$ on the vector $v_{0}$ is

[TABLE]

where $C$ indicates a constant and $a=\frac{1-\alpha}{Z\alpha}$ .

By applying MLE to estimate $c_{s}$ on Equation (9), we have

[TABLE]

Equation (10) indicates that the sentence embedding equivalents to weighted averaging of contextual embedding. Specifically, the contextual embedding is $\sum_{j=1}^{n}A_{ij}v_{ij}$ , while the weight of each contextual embedding is $\frac{a}{Pr(w_{i})+\frac{a}{2}}$ . Comparing with SIF (Arora et al., 2017) and uSIF (Ethayarajh, 2018) models, the primary significance of ours is using a pairwise bi-gram model instead of the uni-gram model. Moreover, we do not introduce a common discourse vector $c_{0}$ as they did.

3.3 Contextual Embedding

The pairwise embedding $V_{ij}$ and the attention score $A_{ij}$ are keys to designing contextual embedding, while the $V_{ij}$ represents the meaning of the word pair $w_{i}$ and $w_{j}$ , and the $A_{ij}$ measures how likely the word pair appears together. Our proposed method does not make any assumption to $V_{ij}$ and only one assumption to $A_{ij}$ . Therefore, we have great freedom to define a variety of $V_{ij}$ and $A_{ij}$ .

3.3.1 Positional Word Embedding

To capture sequential information, we add the position embedding to the word embedding directly:

[TABLE]

$PosEmbed(i)$ denotes the position embedding for the $i$ th word in the sentence. We reference the definition of position embedding from the Transformer (Vaswani et al., 2017):

[TABLE]

where $i$ is the word position, $m$ is the dimension being generated, and $d_{v}$ is the number of the entire position embedding dimension.

3.3.2 Pairwise Embedding Using Log-Kernel

It is tough to estimate pairwise embedding $v_{ij}$ accurately with a limited corpus because of the sparse latent space. For example, Wiki corpus has almost 180,000 unique tokens, and the total number of pairwise word embedding will be 32.4 billion. This sparse hidden parameter space makes any existing algorithms unable to obtain stable estimations. To avoid this issue, we design our pairwise word embedding as

[TABLE]

where $v_{i}$ is the initial word embedding, and $[\text{ }\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{$ \displaystyle\bullet $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{$ \textstyle\bullet $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{$ \scriptstyle\bullet $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{$ \scriptscriptstyle\bullet $}}}}}\text{ ; }\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{$ \displaystyle\bullet $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{$ \textstyle\bullet $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{$ \scriptstyle\bullet $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{$ \scriptscriptstyle\bullet $}}}}}\text{ }]$ means the concatenation operation. We call this element-wise non-linear transform between $v^{\prime}_{j}$ and $v^{\prime}_{i}$ as the Log-Kernel.

3.3.3 Non-Parametric Pairwise Attention

Pairwise attention evaluates the probability of occurring a pair of words. The only assumption is $\sum_{j=1}^{n}A_{ij}=1$ . The idea of pairwise attention is similar to the self-attention from Transformers, but our model is untrainable. An easy way to think of non-parametric features is the two words’ relative position and semantic similarity. Thus, we directly dot product each two positional word embedding to measure the probability of a pair of words. We formalize our non-parametric pairwise attention as:

[TABLE]

where $d$ is the dimension of embedding $v^{\prime}_{i}$ .

3.4 Noise Removal

We apply Taylor expansion to simplify the calculation of log-likelihood $F_{w}$ in Equation (9). However, the Taylor expansion reaches an approximate estimation, and some small error terms are ignored in Equation (10). Thus, the final sentence embedding removes the projections on singular vectors with the lowest singular values to remove these uncertainty introduced by the Taylor expansion.

We represent each sentence from the dataset with a vector and denote the matrix of sentence embedding as $X\in\mathbf{R}^{l\times d}$ , where $l$ indicates the number of sentences and $d$ is the dimension of sentence vectors. Then, to apply singular value decomposition, we find three matrices $U\in\mathbf{R}^{l\times l}$ , $S\in\mathbf{R}^{l\times d}$ , and $V\in\mathbf{R}^{d\times d}$ that satisfy:

[TABLE]

where $S$ is a diagonal matrix.

We further assume that the last $k\leq d$ singular vectors with the smallest singular values contain the error terms of estimation. So the final estimation of each sentence vector will remove the projection on $V_{k}\in\mathbf{R}^{k\times d}$ as below

[TABLE]

3.5 Time Complexity Analysis

We denote the number of words in a sentence as $n$ , the dimension of word embedding is $d$ , and the number of noise singular vectors is $k$ . Since we can pre-calculate $U$ , $S$ , and $V$ for a specific dataset during the training stage, inference sentence embedding time complexity for one sentence is $O(n^{2}d+k^{2}d)$ . Since sentence length is roughly 20 to 30 words, and the sufficient $k$ is always under 20, NoPPA should have lower time complexity than solutions based on latent space analysis (Yang et al., 2018; Wang et al., 2021).

4 Experiment

We evaluate our method on eight different downstream tasks. We also conduct ablation experiments to discuss the sufficiency of each strategy of our approach. The sensitivity analysis finally discovers the impact of hyper-parameters.

4.1 Benchmarks

We not only compare against other bag-of-words-based methods, including Average Embedding, TF-IDF Weighted Average, and SIF (Arora et al., 2017), but also with improved methods such as DCT (Almarwani et al., 2019), VLAWE (Ionescu and Butnaru, 2019), GEM (Yang et al., 2018), and S3E (Wang et al., 2021). We further compare our method with popular parametric methods with similar sentence embedding dimensions such as FastSent (Hill et al., 2016), Sen2Vec (Moghadasi and Zhuang, 2020), USE(DAN) (cer2018universal), and SBERT (Reimers and Gurevych, 2019).

4.2 Tasks

We evaluate our method on eight classification tasks. These tasks consist of fine-grained sentiment classification (MR, CR, MPQA, SST-2) (Pang and Lee, 2005; Hu and Liu, 2004; Wiebe et al., 2005; Socher et al., 2013), question type classification (TREC) (Voorhees and Harman, 2003), subjectivity/objectivity classification (SUBJ) (Pang and Lee, 2004), entailment relation classification (SICK-E) (Lai and Hockenmaier, 2014), and paraphrase identification (MRPC) (Dolan et al., 2004). These datasets can measure how well the sentence embedding is.

4.3 Detail Settings

We evaluate our methods with the SemEval toolkit (Conneau and Kiela, 2018), a library for measuring the quality of sentence embedding. For tasks, we build a one-hidden-layer MLP with 50 parameters as a classifier. The classifier is optimized with Adam (Kingma and Ba, 2014) with 64 batch sizes and a 0.0 dropout rate. We tune hyper-parameters $k$ 111 $k\in[0,24]$ , and $a$ 222 $a\in\{0.01,0.15\}$ with the Bayesian optimization over 40 times. Our model relies on GloVe (Pennington et al., 2014) as static word embedding and word frequencies collected from Wiki corpus. To fairly compare methods, we testify some benchmarks by ourselves with the same experimental settings.

4.4 Main Results

Experimental results on eight different supervised downstream tasks are listed at Table 1. We report both mean and standard deviation scores for the full model NoPPA over five different random seeds in the last line. The models in the table marked with * are testified with the same classifier setting by ourselves, and we also apply grid search to find out the best hyper-parameters for them. All non-parametric methods reported in Table 1 use GloVe word embedding only. We also show ablation study results on this table which will be discussed later. Here, CE denotes averaging contextual embedding followed by Section 3.3, NR means noisy removal strategy described in Section 3.4, SFW stands for the smooth frequency weight $\frac{a}{Pr(w_{i})+\frac{a}{2}}$ from Equation (10), CE-avg indicates SFW constantly equals 1 so that we can evaluate the quality of contextual embedding only.

NoPPA generates high-quality sentence embedding for downstream tasks.

The performance of our method on all downstream tasks is shown in the last line of Table 1. Our method makes significant progress compared to all weighted averaging-based methods, including AVG, SIF, and TFIDF. We also outperform DCT, which is designed to capture sequential information using discrete cosine transform, on average for eight datasets. We reach better performance than most latent space analysis methods including VLAWE and S3E. Although we didn’t get the GEM score, it’s worth mentioning that its time complexity is significantly higher than ours since it uses SVD in the inference phase. Compared with parametric models, we do better than most models, except SBERT-WK.

CE, SFW, and NR are sufficient strategies.

Empirical speaking, the architecture of NoPPA can be separated into three components, which are contextual embedding in Section 3.3, smooth frequency weight $\frac{a}{Pr(w_{i})+\frac{a}{2}}$ , and noisy removal strategy in Section 3.4. According to Table 1, CE-avg constantly makes positive contributions to improve the baseline model GloVe-avg among all datasets except MR. Both SFW and NR make contributions to improve NoPPA on average.

NoPPA is slowed down by Log-Kernel.

S3E (Wang et al., 2021) is the fastest method among all benchmarks. We compare the inference time between NoPPA and S3E on 19,854 sentences from the SICK-E dataset. The total inference time of S3E on our environment333Intel i9-11900KF @ 3.5GHz is 3.2344410 times average is 3.2267, standard error is 0.0249 seconds, while that of NoPPA is 7.5555510 times average is 7.5489, standard error is 0.0738 seconds. However, we found that removing the $\log_{2}$ operation will only take 2.4566610 times average is 2.4524, standard error is 0.0259 seconds, 1.32 times faster than S3E. This result corroborates our time complexity analysis results to NoPPA in Section 3 ignoring implementation differences. Since the run time speed of NoPPA will be affected by different implementations of the $\log_{2}$ function, we suggest users who seek higher speed explore other non-linear kernels.

5 Analysis

This section answers two questions:

How to choose the best hyper-parameter $a$ for different datasets? 2. 2.

Why does the model work?

5.1 Impacts of hyper-parameter $a$

Since we count word frequency $Pr(w_{i})$ from large corpus instead of downstream datasets, smooth frequency weight $weight(w_{i})=\frac{a}{Pr(w_{i})+\frac{a}{2}}$ has no change by giving a specific $a$ over all datasets. To analyze how parameter $a$ influences the weights of words, we calculate the average weights of 13 stop words777{of, the, a, in, at, to, with, by, and, are, is, ”.”, ”,”} and 16 meaningful words888{film, man, women, dogs, cats, name, air, phone, special, large, past, emotional, easy, need, found, show} by setting different $a$ . Then, we plot the average weights of the different groups in Figure 2.

According to Figure 2, the average weight of stopwords drops faster than that of meaningful words when we decrease parameter $a$ gradually. That makes it possible to filter stop words and enhance the contributions of meaningful words by setting different $a$ . From Figure 2, we conclude that the best $a$ to distinguish the two groups of words should be between 1e-1 and 1e-2. If $a$ is smaller than $10^{-2}$ , then the meaningful words will be weakened. The experiments have supported this assertion that NoPPA never reach their best score with $a$ higher than $0.1$ on all datasets (see Appendix B).

5.2 What Knowledge Does NoPPA Learn

We try to understand the model from different views. One of the views is a result-oriented approach which means studying the sentence embedding generated by the method. The second view is a process-oriented approach in which we examine the model to understand what it learned.

Understanding Context Topics

We take three sentences999Three sample sentences are: ”the girl eats a cake”, ”the beautiful girl eats a cake”, ”the girl eats a delicious cake”. as examples and compare the contribution of the same word among different sentences. In detail, we concatenate the original word embedding of each word horizontally so that the size of each word embedding is the same as the size of sentence embedding. Then we draw the cosine similarity between the concatenated word embedding and the sentence embedding in Figure 1.

The three sentences use the simplest syntax: a subject, a verb, and an object. Everyone can easily predict the topics of these sentences when they read through them. For example, in the second statement, we might infer that the speaker will talk about how beautiful the girl is in the following conversation. People always pay more attention to the words included in their predicted topics. It means that the subject ("girl") should make more contribution than the object ("cake") to the sentence meaning in the second sentence.

We observe these dynamic weights that change with contextual topics in Figure 1. It indicates that our model dynamically assigns different attention to the same word regarding different contexts. Particularly, the subject ("girl") in the first and the second sentence has 47% similarity to the sentences. In contrast, it has only 41% similarity in the third sentence. We can find this phenomenon in analyzing the verb ("eats") and the object ("cake") as well. Surprisingly, although stopwords ("the", "a") show this phenomenon as well, they are consistently assigned with low attention. Thus, we conclude that our method can detect contextual topics and give more attention to topic words.

Detecting Linguistic Phrases and Causation

In linguistic, sentence meaning has hierarchical levels. A robust language model can detect sophisticated semantic relationships. To investigate what kinds of knowledge the NoPPA uses, we choose a sentence 101010”David got injured during rough hiking in the mountain, so he is bleeding right now” and draw the heat map for the pairwise weight score from the NoPPA in Figure 3.

The example sentence contains hierarchical semantic relationships. In the lexical level, it has common linguistic phrases such as "got injured", "rough hiking in somewhere", "right now". The causal relationship between "injured" and "bleeding" is the next level of semantic understanding. Among them, the top-level semantic meaning is coreference resolution between "David" and "he".

For the first scan to Figure 3, we can easily find that the model captures all kinds of linguistic phrases we mentioned before. Nevertheless, looking at Figure 3 in detail, we will surprisingly notice that the line of "bleeding" slightly lights up the column "injured" as well. Actually, "bleeding" assigns a weight of 4.8% to "injured", which is the third-highest weight assigned by "bleeding"111111The attention score of ”bleeding” to other words are David=1.21%, got=2.17%, injured=4.8%, during=3.41%, rough=2.51%, hiking=1.81%, in=1.49%, the=1.55%, mountain=1.21%, ”,”=2.02%, so=2.68%, he=4.59%, is=4.51%, bleeding=55.25%, right=6.47%, now=3.5%. Sum of all attention score of ”bleeding” is weighted by SFW to 99.19%.. The two highest weights are itself and the word "right" next to it. According to this finding, although the attention score is not as large as expected, we can still conclude that NoPPA can identify long-range causal inference relationships. However, the problem that NoPPA cannot handle the coreference resolution well still remains. We provide more examples in Appendix A.

6 Conclusion

We propose Non-Parametric Pairwise Attention Random Walk Model (NoPPA) to generate high-quality sentence embedding with a low computing complexity. NoPPA first constructs contextual embedding (CE) to capture the contextual information for each word with pre-trained static word embedding, pre-computed static position embedding, and an element-wise non-linear transform. Then, pre-counted word frequency is applied to assign non-contextual weights (Smooth Frequency Weight, SFW). Next, we weight average contextual embedding based on the assigned SFW. Finally, the projections on the last few principal components are subtracted to remove the estimating errors (Noise Removal, NR). NoPPA is a non-parametric method, and it runs in only $O(n^{2}d+k^{2}d)$ time complexity during the inference stage. We evaluate NoPPA on eight downstream text classification datasets. According to the results, NoPPA constantly outperforms all bag-of-words-based methods and does better than non-parametric methods using time information and most of the latent-space-analysis-based methods on average with lower time complexity. Visualization analysis supports that NoPPA can detect context topics, common phrases, and long-range word-word causation.

Limitations

First, NoPPA sentence embedding remains most properties of the word embedding it uses since NoPPA is simple. Thus, if the word embedding is anisotropic (Reimers and Gurevych, 2019), the NoPPA sentence embedding will be anisotropic. In other words, NoPPA may not be suitable for information retrieval systems before doing post-processing such as whitening (Huang et al., 2021).

Second, solving $log_{2}(\cdot)$ used as non-linear function relies on complex algorithms in all numerical computing libraries. Thus, future research can explore other faster non-linear kernels.

Appendix A More Visualization Examples

We provide two more examples of non-parametric self-attention results to illustrate how NoPPA utilizes common phrases and word-level causality.

In Figure 4, NoPPA finds out "car crashed into", "front window", and "broken into pieces", which are widely used in daily conversations. NoPPA also detects the causation between the word "crashed" and the word "broken" with a score of 4.65%, while the other words with higher weight scores are "window was broken into pieces".

In Figure 5, NoPPA finds out the two common phrases "face trouble" and "think it over", as well as the causation between the word "trouble" and the word "think". More precisely, the word "think" assigns 2.62% weight to the word "trouble". This is the highest score for all five words in the first part of the sentence.

Appendix B Hyper-Parameters for Each Dataset

We record our bayesian search results to the best hyper-parameter setting for each dataset in Table 2.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Almarwani et al. (2019) Nada Almarwani, Hanan Aldarmaki, and Mona Diab. 2019. Efficient sentence embedding using discrete cosine transform. ar Xiv preprint ar Xiv:1909.03104 .
2Arora et al. (2016) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics , 4:385–399.
3Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In International conference on learning representations .
4Barkan et al. (2020) Oren Barkan, Noam Razin, Itzik Malkiel, Ori Katz, Avi Caciularu, and Noam Koenigstein. 2020. Scalable attentive sentence pair modeling via distilled sentence embedding. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 3235–3242.
5Cheng (2021) Xingyi Cheng. 2021. Dual-view distilled bert for sentence embedding. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 2151–2155.
6Conneau and Kiela (2018) Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. ar Xiv preprint ar Xiv:1803.05449 .
7Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. ar Xiv preprint ar Xiv:1705.02364 .
8Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

NoPPA: Non-Parametric Pairwise Attention Random Walk Model

Abstract

1 Introduction

2 Related Work

3 Pairwise Attentive Random Walk Model for Sentence Embedding

3.1 Pairwise Attentive Random Walk Model

3.2 Sentence Embedding Estimation

3.3 Contextual Embedding

3.3.1 Positional Word Embedding

3.3.2 Pairwise Embedding Using Log-Kernel

3.3.3 Non-Parametric Pairwise Attention

3.4 Noise Removal

3.5 Time Complexity Analysis

4 Experiment

4.1 Benchmarks

4.2 Tasks

4.3 Detail Settings

4.4 Main Results

NoPPA generates high-quality sentence embedding for downstream tasks.

CE, SFW, and NR are sufficient strategies.

NoPPA is slowed down by Log-Kernel.

5 Analysis

5.1 Impacts of hyper-parameter aaa

5.2 What Knowledge Does NoPPA Learn

Understanding Context Topics

Detecting Linguistic Phrases and Causation

6 Conclusion

Limitations

Appendix A More Visualization Examples

Appendix B Hyper-Parameters for Each Dataset

5.1 Impacts of hyper-parameter $a$