Semantic Hilbert Space for Text Representation Learning

Benyou Wang; Qiuchi Li; Massimo Melucci; Dawei Song

arXiv:1902.09802·cs.CL·February 27, 2019

Semantic Hilbert Space for Text Representation Learning

Benyou Wang, Qiuchi Li, Massimo Melucci, Dawei Song

PDF

1 Repo

TL;DR

This paper introduces a novel Semantic Hilbert Space framework using complex-valued vectors for non-linear semantic composition, improving text representation and classification accuracy over traditional linear models.

Contribution

It proposes a new non-linear semantic composition model on a Semantic Hilbert Space with an end-to-end neural network, enhancing text understanding and classification.

Findings

01

Effective on six benchmarking datasets

02

Demonstrates robustness and self-explanation capabilities

03

Outperforms linear models in semantic tasks

Abstract

Capturing the meaning of sentences has long been a challenging task. Current models tend to apply linear combinations of word features to conduct semantic composition for bigger-granularity units e.g. phrases, sentences, and documents. However, the semantic linearity does not always hold in human language. For instance, the meaning of the phrase `ivory tower' can not be deduced by linearly combining the meanings of `ivory' and `tower'. To address this issue, we propose a new framework that models different levels of semantic units (e.g. sememe, word, sentence, and semantic abstraction) on a single \textit{Semantic Hilbert Space}, which naturally admits a non-linear semantic composition by means of a complex-valued vector word representation. An end-to-end neural network~\footnote{https://github.com/wabyking/qnn} is proposed to implement the framework in the text classification task, and…

Tables5

Table 1. Table 1. Dataset Statistics. (CV means 10-fold cross validation for testing performance.)

Dataset	train	test	vocab.	task	Classes
CR	4K	CV	6K	product reviews	2
MPQA	11k	CV	6K	opinion polarity	2
SUBJ	10k	CV	21k	subjectivity	2
MR	11.9k	CV	20k	movie reviews	2
SST	67k	2.2k	18k	movie reviews	2
TREC	5.4k	0.5k	10k	Question	6

Table 2. Table 2. Experimental Results in percentage (%). The best performed value (except for CNN/LSTM) for each dataset is in bold. where † means a significant improvement over FasText.

Model	CR	MPQA	MR	SST	SUBJ	TREC
Uni-TFIDF	79.2	82.4	73.7	-	90.3	85.0
Word2vec	79.8	88.3	77.7	79.7	90.9	83.6
FastText (Joulin et al., 2016)	78.9	87.4	76.5	78.8	91.6	81.8
Sent2Vec (Pagliardini et al., 2018)	79.1	87.2	76.3	80.2	91.2	85.8
CaptionRep (Hill et al., 2016a)	69.3	70.8	61.9	-	77.4	72.2
DictRep (Hill et al., 2016b)	78.7	87.2	76.7	-	90.7	81.0
Ours: QPDN	81.0^†	87.0	80.1^†	83.9^†	92.7^†	88.2^†
CNN (Kim, 2014)	81.5	89.4	81.1	88.1	93.6	92.4
BiLSTM (Conneau et al., 2017b)	81.3	88.7	77.5	80.7	89.6	85.2

Table 3. Table 3. Physical meanings and constraints

Components	DNN	QPDN
Sememe	-	basis vector / basis state ${w \| w \in 𝒞^{n}, \| \| w \| \|_{2} = 1,}$ complete &orthogonal
Word	real vector $(- \infty, \infty)$	unit complex vector / superposition state ${w \| w \in 𝒞^{n}, {‖ w ‖}_{2} = 1}$
Low-level representation	real vector $(- \infty, \infty)$	density matrix / mixed system ${ρ \| ρ = ρ^{*}, t r (ρ) = 1}$
Abstraction	CNN/RNN $(- \infty, \infty)$	unit complex vector / measurement ${w \| w \in 𝒞^{n}, {‖ w ‖}_{2} = 1}$
High-level representation	real vector $(- \infty, \infty)$	probabilities/ measured probability $(0, 1)$

Table 4. Table 4. Ablation Test

Setting	SST	$Δ$
FastText (Joulin et al., 2016)	0.7880	-0.0511
FastText (Joulin et al., 2016) with double-dimension real word vectors	0.7883	-0.0508
fixed amplitude part but trainable phase part	0.8199	-0.0192
replace trainable weights with fixed mean weights	0.8303	-0.0088
replace trainable weights with fixed IDF weights	0.8259	-0.0132
non-trainable projectors with fixed orthogonal ones	0.8171	-0.0220
replace projectors with dense layer	0.8221	-0.0170
QPDN	0.8391	-

Table 5. Table 5. The learned measurement for dataset MR. They are selected according to nearest words for a measurement vector in Semantic Hibert Space

Measurement	Selected neighborhood words
1	change, months, upscale, recently, aftermath
2	compelled, promised, conspire, convince, trusting
3	goo, vez, errol, esperanza, ana
4	ice, heal, blessedly, sustains, make
5	continue, warned, preposterousness, adding, falseness

Equations12

∣ w ⟩ = j = 1 \sum n r_{j} e^{i ϕ_{j}} ∣ e_{j} ⟩

∣ w ⟩ = j = 1 \sum n r_{j} e^{i ϕ_{j}} ∣ e_{j} ⟩

r_{j}^{(1)} e^{i ϕ_{j}^{(1)}} + r_{j}^{(2)} e^{i ϕ_{j}^{(2)}}^{2} = r_{j}^{(1)}^{2} + r_{j}^{(2)}^{2} + 2 r_{j}^{(1)} r_{j}^{(2)} cos (ϕ_{j}^{(1)} - ϕ_{j}^{(2)})

r_{j}^{(1)} e^{i ϕ_{j}^{(1)}} + r_{j}^{(2)} e^{i ϕ_{j}^{(2)}}^{2} = r_{j}^{(1)}^{2} + r_{j}^{(2)}^{2} + 2 r_{j}^{(1)} r_{j}^{(2)} cos (ϕ_{j}^{(1)} - ϕ_{j}^{(2)})

ρ = i \sum p (i) ∣ w_{i} ⟩ ⟨ w_{i} ∣,

ρ = i \sum p (i) ∣ w_{i} ⟩ ⟨ w_{i} ∣,

p_{i} = t r (P_{i} ρ)

p_{i} = t r (P_{i} ρ)

r_{j} e^{i ϕ_{j}}

r_{j} e^{i ϕ_{j}}

= ∣ r_{j}^{(1)} ∣^{2} + ∣ r_{j}^{(2)} ∣^{2} + 2 r_{j}^{(1)} r_{j}^{(2)} cos (ϕ_{j}^{(1)} - ϕ_{j}^{(2)})

\times e^{i a r c t a n (\frac{r _{j}^{(1)} s i n ( ϕ _{j}^{(1)} ) + r _{j}^{(2)} s i n ( ϕ _{j}^{(2)} )}{r _{j}^{(1)} c o s ( ϕ _{j}^{(1)} ) + r _{j}^{(2)} c o s ( ϕ _{j}^{(2)} )})}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wabyking/qnn
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Semantic Hilbert Space for Text Representation Learning

Benyou Wang, Qiuchi Li, Massimo Melucci

University of Padua Via Giovanni Gradenigo, 6/B Padua Italy PD 35121

wang,qiuchili,[email protected]

and

Dawei Song

Beijing Institute of Technology Haidian Beijing China 100081

[email protected]

(2019)

Abstract.

Capturing the meaning of sentences has long been a challenging task. Current models tend to apply linear combinations of word features to conduct semantic composition for bigger-granularity units e.g. phrases, sentences, and documents. However, the semantic linearity does not always hold in human language. For instance, the meaning of the phrase “ivory tower” cannot be deduced by linearly combining the meanings of “ivory” and “tower”. To address this issue, we propose a new framework that models different levels of semantic units (e.g. sememe, word, sentence, and semantic abstraction) on a single Semantic Hilbert Space, which naturally admits a non-linear semantic composition by means of a complex-valued vector word representation. An end-to-end neural network 111https://github.com/wabyking/qnn is proposed to implement the framework in the text classification task, and evaluation results on six benchmarking text classification datasets demonstrate the effectiveness, robustness and self-explanation power of the proposed model. Furthermore, intuitive case studies are conducted to help end users to understand how the framework works.

text understanding, neural network, quantum theory

† Benyou Wang and Qiuchi Li contribute equally and share the co-first authorship.

*Massimo Melucci ([email protected]) is the corresponding author.

††journalyear: 2019††copyright: iw3c2w3††conference: Proceedings of the 2019 World Wide Web Conference; May 13–17, 2019; San Francisco, CA, USA††booktitle: Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA††doi: 10.1145/3308558.3313516††isbn: 978-1-4503-6674-8/19/05††ccs: Information systems Document structure††ccs: Information systems Content analysis and feature selection

1. Introduction

In natural language understanding, it is crucial, yet challenging, to model sentences and capture their meanings. Essentially, most statistical machine learning models (Hill et al., 2016a; Kiros et al., 2015; Arora et al., 2016; Conneau et al., 2017a; Pagliardini et al., 2017) are built within a linear bottom-up framework, where words are the basic features adopting a low-dimensional vector representation, and a sentence is modeled as a linear combination of individual word vectors. Such linear semantic composition is efficient, but does not always hold in human language. For example, the phrase “ivory tower”, which means “a state of privileged seclusion or separation from the facts and practicalities of the real world”, is not a linear combination of the individual meanings of “ivory” and “tower”. Instead, it carries a new meaning. We are therefore motivated to investigate a new language modeling paradigm to account for such intricate non-linear combination of word meanings.

Drawing inspiration from the recent findings in the emerging research area of quantum cognition, which suggest that human cognition (Busemeyer and Bruza, 2012; Aerts et al., 2013; Aerts and Sozzo, 2014) especially language understanding (Bruza et al., 2008, 2009; Wang et al., 2016) exhibit certain non-classical phenomena (i.e. quantum-like phenomena), we propose a theoretical framework, named Semantic Hilbert Space, to formulate quantum-like phenomena in language understanding and to model different levels of semantic units in a unified space.

In Semantic Hilbert Space, we assume that words can be modeled as microscopic particles in superposition states, over the basic sememes (i.e. minimum semantic units in linguistics), while a combination of word meanings can be viewed as a mixed system of particles. The Semantic Hilbert Space represents different levels of semantic units, ranging from basic sememes, words and sentences, on a unified complex-valued vector space. This is fundamentally different from existing quantum-inspired neural networks for question answering (Zhang et al., 2018a, b) which are based on a real vector space. In addition, we introduce a new semantic abstraction, named as Semantic Measurements, which are also embedded in the same vector space and trainable to extract high-level features from the mixed system.

As shown in Fig. 1, the Semantic Hilbert Space is built on the basis of quantum probability (QP), which is the probability theory for explaining the uncertainty of quantum superposition. As quantum superposition requires the use of the complex field, Semantic Hilbert Space has complex values and operators. In particular, the probability function is implemented by a unique (complex) density operator.

Semantic Hilbert Space adopts a complex-valued vector representation of unit length, where each component adopts an amplitude-phase form $z=re^{i\phi}$ . We hereby hypothesize that the amplitude $r$ and complex phase $\phi$ can be used to encode different levels of semantics such as lexical-level co-occurrence, hidden sentiment polarity or topic-level semantics. When word vectors are combined, even in a simple complex-valued addition form, the resulting expression will entail a non-linear composition of amplitudes and phases, thus indicating a complicated fusion of different levels of semantics. A more detailed explanation is given in Sec. 3. In this way, the complex-valued word embedding is fundamentally different from existing real-valued word embedding. A series of ablation tests indicate that the complex-valued word embedding can increase performance.

The Semantic Hilbert Space is an abstract representation of our approach to modeling language through QP. At the level of implementation, an efficient and effective computational framework is needed to cope with large text collections. To do so, we propose an end-to-end neural network architecture, which provides means for training of the network components. Each component corresponds to a physical meaning of quantum probability with well-defined mathematical constraints. Moreover, each component is easier to understand than the kernels in convolutional neural network and cells in recurrent neural networks.

The network proposed in this paper is evaluated on six benchmarking datasets for text classification and achieves a steady increase over existing models. Moreover, it is shown that the proposed network is advantageous due to its high robustness and self-explanation capability.

2. Semantic Hilbert Space

The mathematical foundation of Quantum Theory is established on a Hilbert Space over the complex field. In order to borrow the underlying mathematical formalism of quantum theory for language understanding, it is necessary to build such a Hilbert Space for language representation. In this study, we build a Semantic Hilbert Space $\mathcal{H}$ over the complex field. As is illustrated in Fig. 1, multiple levels of semantic units are modeled on this common Semantic Hilbert Space. In the rest of this section, the semantic units under modeling are introduced separately.

We follow the standard Dirac Notation for Quantum Theory. A unit vector and its transpose are denoted as a ket $\ket{\mu}$ and a bra $\bra{\mu}$ , respectively. The inner product and outer product of two unit vectors $\vec{u}$ and $\vec{v}$ are denoted as $\braket{u}{v}$ and $\ket{u}\bra{v}$ respectively.

2.1. Sememes

Sememes are the minimal non-separable semantic units of word meanings in language universals (Goddard and Wierzbicka, 1994). For example, the word “blacksmith” is composed of sememes “human”, “occupation”, “metal” and “industrial”. We assume that the Semantic Hilbert Space $\mathcal{H}$ is spanned by a set of orthogonal basis $\{\ket{e_{j}}\}_{j=1}^{n}$ corresponding to a finite closed set of sememes $\{e_{j}\}_{j=1}^{n}$ . In the quantum language, the set of sememes are modeled as basis states, which is the basis for representing any quantum state. In Fig. 1, the axes of the Semantic Hilbert Space correspond to the set of sememe states, and semantic units with larger granularity are represented based on quantum probability.

2.2. Words

The meaning of a word is a combination of sememes. We adopt the concept of superposition to formulate this combination. Essentially, a word $w$ is modeled as a quantum particle in superposition state, represented by a unit-length vector in the Semantic Hilbert Space $\mathcal{H}$ , as can be seen in Fig. 1. It can be written as a linear combination of the basis states for sememes:

[TABLE]

where the complex-valued weight $r_{j}e^{i\phi_{j}}$ denotes how much the meaning of word $w$ is associated with the sememe $e_{j}$ . Here $\{r_{j}\}_{j=1}^{n}$ are non-negative real-valued amplitudes satisfying $\sum_{j=1}^{n}{r_{j}}^{2}$ =1 and $\phi_{j}\in[-\pi,\pi]$ are the corresponding complex phases. We could also transfer the complex number in a complex plane as $re^{i\phi}=r\cos\phi+ir\sin\phi$ .

It is worth noting that the complex phases $\{\phi_{j}\}$ are crucial as they implicitly entail the quantum interference between words. Suppose two words $w_{1}$ and $w_{2}$ are associated to weights $r_{j}^{(1)}e^{i\phi_{j}^{(1)}}$ and $r_{j}^{(2)}e^{i\phi_{j}^{(2)}}$ for the sememe $e_{j}$ . The two words in combination are therefore at the state $e_{j}$ with a probability of

[TABLE]

where the term $2r_{j}^{(1)}r_{j}^{(2)}\cos(\phi_{j}^{(1)}-\phi_{j}^{(2)})$ reflects the interference between the two words, where as the classical case corresponds to a particular case $\phi_{j}^{(1)}=\phi_{j}^{(2)}=0$ .

2.3. Semantic Compositions

As is illustrated in Fig. 1, we view a word composition (e.g. a sentence) as a bag of words (Harris, 1954), each of which is modeled as a particle in superposition state on the Semantic Hilbert Space $\mathcal{H}$ . To obtain the semantic composition of words, we leverage the concept of quantum mixture and formulate the word composition as a mixed system composed of the word superposition states. The system is in a mixed state represented by a $n$ -by- $n$ density matrix $\rho$ on $\mathcal{H}$ , which is positive semi-definite with trace 1. It is computed as follows:

[TABLE]

where $\ket{w_{i}}$ denotes the superposition state of the $i$ -th word and $p(i)$ is the classical probability of the state $\ket{w_{i}}$ with $\sum_{i}p(i)=1$ . It determines the contribution of the word $w_{i}$ to the overall semantics.

The complex-valued density matrix $\rho$ can be seen non-classical distribution of sememes in $\mathcal{H}$ . Its diagonal elements are real and form a classical distribution of sememes, while its complex-valued off–diagonal entries encode the interplay between sememes, which in turn gives rise to the interference between words. A density matrix assigns a probability value for any state on $\mathcal{H}$ such that the values for any set of orthogonal states sum up to 1 (Gleason, 1957). Hence it is visualized as an ellipsoid in Fig. 1, assigning a quantum probability to a unit vector with the intersection length.

2.4. Semantic Measurements

As a non-classical probability distribution, a sentence density matrix carries rich information and in particular it contains all the information about a quantum system. In order to extract the relevant information to a concrete task from the semantic composition, we build a set of measurements and compute the probability that the mixed system falls onto each of the measurements as a high-level abstraction of the semantic composition.

Suppose our proposed semantic measurements are associated with a set of measurement projectors $\{P_{i}\}_{i=1}^{k}$ . According to the Born’s rule (Born, 1926), applying the measurement projector $P_{i}$ onto the sentence density matrix $\rho$ yields the following result:

[TABLE]

Here, we only consider pure states as measurement states, i.e. $P_{i}=\ket{v_{i}}\bra{v_{i}}$ . Moreover, we ignore the constraints of the measurements states $\{\ket{v_{i}}\}_{i=1}^{k}$ (i.e. orthogonality or completeness), but keep them trainable, so that the most suitable measurements can be determined automatically by the data in a concrete task, such as classification or regression. In this way, the trainable semantic measurements can be understood as a similar approach to supervised dimensionality reduction (Fisher, 1936), but in a quantum probability framework with complex values.

3. Quantum Probability Driven Network

In order to implement the proposed framework, we further propose an end-to-end neural network based on quantum probability. Fig. 2 shows the architecture of the proposed Quantum Probability Driven Network (QPDN). The embedding layer, which is composed of a unit complex-valued embedding and a term-weight lookup table, captures the basic lexical features. The mixture layer is designed to combine the low-level bag-of-word features with an additive complex-valued outer product operation. The measurement layer adopts a set of trainable semantic measurements to extract the higher-level features for the final linear classifier. In the following we will introduce the architecture layer by layer.

3.1. Embedding Layer

The parameters of the embedding layer are $\{R,\Phi,\Pi\}$ , respectively, denoting the amplitude embedding, the phase embedding, and the term-weight lookup table. Eq. 1 expresses a quantum representation as a unit-length, complex-valued vector representation for a word $w$ , i.e. $\ket{w}=[r_{1}e^{i\phi_{1}},r_{2}e^{i\phi_{2}}...r_{n}e^{i\phi_{n}}]^{T}$ . The term-weight lookup table is used to weight words for semantic combinations, which will be described in the next subsection. During training, word embeddings need to be normalized to unit length after each batch.

This representation allows for a non-linear composition of amplitudes and phases in its mathematical form. Suppose two words $w_{1}$ and $w_{2}$ are of weights $r_{j}^{(1)}e^{i\phi_{j}^{(1)}}$ and $r_{j}^{(2)}e^{i\phi_{j}^{(2)}}$ for the $j^{th}$ dimension (corresponding to the $j^{th}$ sememe). The combination of $w_{1}$ and $w_{2}$ gives rise to a weight $r_{j}e^{i\phi_{j}}$ for the $j^{th}$ dimension computed as

[TABLE]

Where both $r_{j}$ and $\phi_{j}$ is a non-linear combination of $r_{j}^{(1)}$ , $r_{j}^{(2)}$ , $\phi_{j}^{(1)}$ and $\phi_{j}^{(2)}$ . If the amplitudes and phases are associated to different levels of information, the amplitude-phase representation then naturally gives rise to a non-linear fusion of information.

3.2. Mixture Layer

A sentence is modeled as a density matrix, which is constructed in Sec. 2.3. Instead of using uniform weights in Eq. 3, word-sensitive weights are used for each word, which is commonly used in IR, e.g. inverse document frequency (IDF) as a word-dependent weight in TF-IDF scheme (Sparck Jones, 1972).

In order to guarantee the unit trace length for density matrix, the word weights which are from the lookup table in a sentence are normalized to a probability value through a softmax operation: $p(i)={e^{\pi(w_{i})}}\,/\,{\sum^{m}_{j}e^{\pi(w_{j})}}.$ Compared to the IDF weight, the normalized weight for a specific word in our approach is not static but updated adaptively in the training phase. Even in the inference/test phase, the real term weight i.e. $p(w_{i})$ is also not static, but highly depends on the neighbor context words through nonlinear softmax function.

3.3. Measurement Layer

The measurement layer adopts a set of one-rank measurement projectors $\{\ket{v_{i}}\bra{v_{i}}\}_{i=1}^{k}$ where $\ket{v_{i}}\bra{v_{i}}$ is the outer product of its corresponding state in Semantic Hilbert Space $\ket{v_{i}}$ . After each measurement, we can obtain one probability for each measurement state like $q_{j}=tr(\rho\ket{v_{j}}\bra{v_{j}})$ . Finally, we can obtain a vector $\vec{q}=[q_{1},q_{2},...q_{k}]$ . Similarly to the word vectors, the states $\ket{v_{i}}$ are represented as unit states and normalized after several batches.

3.4. Dense Layer

The vector $\vec{q}$ in the measurement layer consists of $k$ positive scalar numbers and it is used to infer the label for a given sentence. A dense layer with softmax activation is adopted after the measurement layer to get a classification probability distribution, i.e. $\widehat{\vec{y}}=\mbox{softmax}(\vec{q}\cdot W)$ . The loss is designed as a cross-entropy loss between $\widehat{\vec{y}}$ and the one-hot label $\vec{y}$ .

4. Experiments

Our model was evaluated on six datasets for text classification: CR customer review (Hu and Liu, 2014), MPQA opinion polarity (Wiebe et al., 2005), SUBJ sentence subjectivity (Pang and Lee, 2005), MR movie review (Pang and Lee, 2005), SST binary sentiment classification (Socher et al., 2013), and TREC question classification (Li and Roth, 2002). The statistics of them are shown in Tab. 1.

We compared the proposed QPDN with various models, including Uni-TFIDF, Word2vec, FastText (Joulin et al., 2016) and Sent2Vec (Pagliardini et al., 2018) as unsupervised representation learning baselines, CaptionRep (Hill et al., 2016a) and DictRep (Hill et al., 2016b) as supervised representation learning baselines, as well as CNN (Kim, 2014) and BiLSTM (Conneau et al., 2017b) for advanced deep neural networks. In Tab. 2, we reported the classification accuracy values of these models from the original papers.

We used Glove word vectors (Pennington et al., 2014) with 50,100,200 and 300 dimensions respectively. The amplitude embedding values are initialized by L2-norm, while the phases in complex-valued embedding are randomly initialized in $-\pi$ to $\pi$ . We searched for the best performance in a parameter pool, which contains a learning rate in $\{1\text{E-}3,1\text{E-}4,1\text{E-}5,1\text{E-}6\}$ , an L2-regularization ratio in $\{1\text{E-}5,1\text{E-}6,1\text{E-}7,1\text{E-}8\}$ , a batch size in $\{8,16,32,64,128\}$ , and the number of measurements in $\{5,10,20,50,100,200\}$ .

The main parameters in our model are $R$ and $\Phi$ . Since both of them are $n\times|V|$ in shape, the number of parameters is roughly two times that of FastText (Mikolov et al., 2013). For the other parameters, $\Pi$ is $|V|\times 1$ , $\{\ket{v_{i}}\}_{i=1}^{k}$ is $k\times 2n$ , while $W$ is $k\times|L|$ with $L$ being the label set. Apart from word embeddings, the model is robust with limited scale at $k\times 2n+n\times|V|+k\times|L|$ for the number of parameters.

The results in Tab. 2 demonstrate the effectiveness of our model and an improvement of classification accuracies over some strong baseline supervised and unsupervised representation models on most of the datasets except MPQA. In comparison with more advanced models including BiLSTM and CNN, our model generally performed better than BiLSTM with increased accuracy values on the multi-class classification dataset (TREC) and three binary text classification datasets (MR, SST & SUBJ). However, it under-performed CNN on all 6 datasets with a difference of over 2% on 3 of them (MPQA, SST & TREC), probably because that it uses fewer parameters and simpler structures.

We argue that QPDN achieved a good balance between effectiveness and efficiency, due to the fact that it outperforms BiLSTM.

5. Discussions

This section discusses the power of self-explanation and conducts an ablation test to examine the usefulness of important components of the network, especially the complex-valued word embedding.

Self-explanation Components

As is shown in Tab. LABEL:table:Interpretability, all components in our model have a clear physical meaning corresponding to quantum probability, where classical Deep Neural Network (DNN) can not well explain the role each component plays in the network. Essentially, we constructed a bottom-up framework to represent each level of semantic units on a uniform Semantic Hilbert Space, from the minimum semantic unit, i.e. sememe, to the sentence representation. The framework was operationalized through superposition, mixture and semantic measurements.

Ablation Test

An ablation test was conducted to examine how each component influences the final performance of QPDN. In particular, a double-length real word embedding network was implemented to examine the use of complex-valued word embedding, while mean weights and IDF weights were compared with our proposed trainable weights. A set of non-trainable orthogonal projectors and a dense layer on top of the sentence density matrix were implemented to analyze the effect of trainable semantic measurements.

Due to limited space, we only reported the ablation test result for SST, which is the largest and hence the most representative dataset. We used 100-dimensional real-valued word vectors and 50-dimensional complex-valued vectors for the models in the ablation test. All models under ablation were comparable in terms of time cost. Tab. 4 showed that each component plays an important role in the QPDN model. In particular, replacing complex embedding with double-dimension real word embedding led to a 5% drop in performance, which indicates that the complex-valued word embedding was not merely doubling the number of parameters.

The comparison with IDF and mean weights showed that the data-driven scheme gave rise to high-quality word weights. The comparison with non-trainable projectors and directly applying a dense layer on the density matrix showed that trainable measurements bring benefits to the network.

Discriminative Semantic Directions

In order to better understand the well-trained measurement projectors, we obtained the top 10 nearest words in the complex-valued vector space for each trained measurement state (like $\ket{v_{i}}$ ), using KD tree (Bentley, 1975). Due to limited space, we selected five measurements from the trained model for the MR dataset, and selected words from the top 10 nearest words to each measurement. As can be seen in Tab. 5, the first measurement was roughly about time, the second one was related to verb words which mainly mean ‘motivating others’. The third measurement grouped uncommon non-English words together. The last two measurements also grouped words sharing similar meanings. It is therefore interesting to see that relevant words can somehow be grouped together into certain topics during the training process, which may be discriminative for the given task.

6. Conclusions

In order to better model the non-linearity of word semantic composition, we have developed a quantum-inspired framework that models different granularities of semantic units on the same Semantic Hilbert Space, and implement this framework into an end-to-end text classification network. The network showed a promising performance on six benchmarking text datasets, in terms of effectiveness, robustness and self-explanation ability. Moreover, the complex-valued word embedding, which inherently achieved non-linear combination of word meanings, brought benefits to the classification accuracy in a comprehensive ablation study.

This work is among the first step to apply the quantum probabilistic framework to text modeling. We believe it is a promising direction. In the future, we would like to further extend this work by considering deeper and more complicated structures such as attention or memory mechanism in language, in order to investigate related quantum-like phenomena on textual data to provide more intuitive insights. Additionally, Semantic Hilbert Space in a tensor space is also worthy to be explored like (Zhang et al., 2018b), which may provide more interesting insights for current communities.

ACKNOWLEDGEMENT

This work is supported by the Quantum Access and Retrieval Theory (QUARTZ) project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 721321, and by the National Key Research and Development Program of China (grant No. 2018YFC0831700), Natural Science Foundation of China (grant No. U1636203), and Major Project of Zhejiang Lab (grant No. 2019DH0ZX01).

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Aerts et al . (2013) Diederik Aerts, Liane Gabora, and Sandro Sozzo. 2013. Concepts and Their Dynamics: A Quantum-Theoretic Modeling of Human Thought. Topics in Cognitive Science (Sept. 2013). https://doi.org/10.1111/tops.12042 ar Xiv: 1206.1069. · doi ↗
3Aerts and Sozzo (2014) Diederik Aerts and Sandro Sozzo. 2014. Quantum Entanglement in Concept Combinations. International Journal of Theoretical Physics 53, 10 (Oct. 2014), 3587–3603. https://doi.org/10.1007/s 10773-013-1946-z · doi ↗
4Arora et al . (2016) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. (Nov. 2016). https://openreview.net/forum?id=Sy K 00v 5xx
5Bentley (1975) Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509–517.
6Born (1926) Max Born. 1926. Zur Quantenmechanik der Sto\s svorgänge. Zeitschrift für Physik 37, 12 (Dec. 1926), 863–867. https://doi.org/10.1007/BF 01397477 · doi ↗
7Bruza et al . (2009) Peter Bruza, Kirsty Kitto, Douglas Nelson, and Cathy Mc Evoy. 2009. Is there something quantum-like about the human mental lexicon? Journal of Mathematical Psychology 53, 5 (2009), 362–377.
8Bruza et al . (2008) Peter D Bruza, Kirsty Kitto, Douglas Mc Evoy, and Cathy Mc Evoy. 2008. Entangling words and meaning. (2008).