TL;DR
This paper introduces a novel Semantic Hilbert Space framework using complex-valued vectors for non-linear semantic composition, improving text representation and classification accuracy over traditional linear models.
Contribution
It proposes a new non-linear semantic composition model on a Semantic Hilbert Space with an end-to-end neural network, enhancing text understanding and classification.
Findings
Effective on six benchmarking datasets
Demonstrates robustness and self-explanation capabilities
Outperforms linear models in semantic tasks
Abstract
Capturing the meaning of sentences has long been a challenging task. Current models tend to apply linear combinations of word features to conduct semantic composition for bigger-granularity units e.g. phrases, sentences, and documents. However, the semantic linearity does not always hold in human language. For instance, the meaning of the phrase `ivory tower' can not be deduced by linearly combining the meanings of `ivory' and `tower'. To address this issue, we propose a new framework that models different levels of semantic units (e.g. sememe, word, sentence, and semantic abstraction) on a single \textit{Semantic Hilbert Space}, which naturally admits a non-linear semantic composition by means of a complex-valued vector word representation. An end-to-end neural network~\footnote{https://github.com/wabyking/qnn} is proposed to implement the framework in the text classification task, and…
| Dataset | train | test | vocab. | task | Classes |
| CR | 4K | CV | 6K | product reviews | 2 |
| MPQA | 11k | CV | 6K | opinion polarity | 2 |
| SUBJ | 10k | CV | 21k | subjectivity | 2 |
| MR | 11.9k | CV | 20k | movie reviews | 2 |
| SST | 67k | 2.2k | 18k | movie reviews | 2 |
| TREC | 5.4k | 0.5k | 10k | Question | 6 |
| Model | CR | MPQA | MR | SST | SUBJ | TREC |
| Uni-TFIDF | 79.2 | 82.4 | 73.7 | - | 90.3 | 85.0 |
| Word2vec | 79.8 | 88.3 | 77.7 | 79.7 | 90.9 | 83.6 |
| FastText (Joulin et al., 2016) | 78.9 | 87.4 | 76.5 | 78.8 | 91.6 | 81.8 |
| Sent2Vec (Pagliardini et al., 2018) | 79.1 | 87.2 | 76.3 | 80.2 | 91.2 | 85.8 |
| CaptionRep (Hill et al., 2016a) | 69.3 | 70.8 | 61.9 | - | 77.4 | 72.2 |
| DictRep (Hill et al., 2016b) | 78.7 | 87.2 | 76.7 | - | 90.7 | 81.0 |
| Ours: QPDN | 81.0† | 87.0 | 80.1† | 83.9† | 92.7† | 88.2† |
| CNN (Kim, 2014) | 81.5 | 89.4 | 81.1 | 88.1 | 93.6 | 92.4 |
| BiLSTM (Conneau et al., 2017b) | 81.3 | 88.7 | 77.5 | 80.7 | 89.6 | 85.2 |
| Components | DNN | QPDN |
| Sememe | - | basis vector / basis state complete &orthogonal |
| Word | real vector | unit complex vector / superposition state |
| Low-level representation | real vector | density matrix / mixed system |
| Abstraction | CNN/RNN | unit complex vector / measurement |
| High-level representation | real vector | probabilities/ measured probability |
| Setting | SST | |
| FastText (Joulin et al., 2016) | 0.7880 | -0.0511 |
| FastText (Joulin et al., 2016) with double-dimension real word vectors | 0.7883 | -0.0508 |
| fixed amplitude part but trainable phase part | 0.8199 | -0.0192 |
| replace trainable weights with fixed mean weights | 0.8303 | -0.0088 |
| replace trainable weights with fixed IDF weights | 0.8259 | -0.0132 |
| non-trainable projectors with fixed orthogonal ones | 0.8171 | -0.0220 |
| replace projectors with dense layer | 0.8221 | -0.0170 |
| QPDN | 0.8391 | - |
| Measurement | Selected neighborhood words |
| 1 | change, months, upscale, recently, aftermath |
| 2 | compelled, promised, conspire, convince, trusting |
| 3 | goo, vez, errol, esperanza, ana |
| 4 | ice, heal, blessedly, sustains, make |
| 5 | continue, warned, preposterousness, adding, falseness |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Semantic Hilbert Space for Text Representation Learning
Benyou Wang, Qiuchi Li, Massimo Melucci
University of Padua Via Giovanni Gradenigo, 6/B Padua Italy PD 35121
wang,qiuchili,[email protected]
and
Dawei Song
Beijing Institute of Technology Haidian Beijing China 100081
(2019)
Abstract.
Capturing the meaning of sentences has long been a challenging task. Current models tend to apply linear combinations of word features to conduct semantic composition for bigger-granularity units e.g. phrases, sentences, and documents. However, the semantic linearity does not always hold in human language. For instance, the meaning of the phrase “ivory tower” cannot be deduced by linearly combining the meanings of “ivory” and “tower”. To address this issue, we propose a new framework that models different levels of semantic units (e.g. sememe, word, sentence, and semantic abstraction) on a single Semantic Hilbert Space, which naturally admits a non-linear semantic composition by means of a complex-valued vector word representation. An end-to-end neural network 111https://github.com/wabyking/qnn is proposed to implement the framework in the text classification task, and evaluation results on six benchmarking text classification datasets demonstrate the effectiveness, robustness and self-explanation power of the proposed model. Furthermore, intuitive case studies are conducted to help end users to understand how the framework works.
text understanding, neural network, quantum theory
† Benyou Wang and Qiuchi Li contribute equally and share the co-first authorship.
*Massimo Melucci ([email protected]) is the corresponding author.
††journalyear: 2019††copyright: iw3c2w3††conference: Proceedings of the 2019 World Wide Web Conference; May 13–17, 2019; San Francisco, CA, USA††booktitle: Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA††doi: 10.1145/3308558.3313516††isbn: 978-1-4503-6674-8/19/05††ccs: Information systems Document structure††ccs: Information systems Content analysis and feature selection
1. Introduction
In natural language understanding, it is crucial, yet challenging, to model sentences and capture their meanings. Essentially, most statistical machine learning models (Hill et al., 2016a; Kiros et al., 2015; Arora et al., 2016; Conneau et al., 2017a; Pagliardini et al., 2017) are built within a linear bottom-up framework, where words are the basic features adopting a low-dimensional vector representation, and a sentence is modeled as a linear combination of individual word vectors. Such linear semantic composition is efficient, but does not always hold in human language. For example, the phrase “ivory tower”, which means “a state of privileged seclusion or separation from the facts and practicalities of the real world”, is not a linear combination of the individual meanings of “ivory” and “tower”. Instead, it carries a new meaning. We are therefore motivated to investigate a new language modeling paradigm to account for such intricate non-linear combination of word meanings.
Drawing inspiration from the recent findings in the emerging research area of quantum cognition, which suggest that human cognition (Busemeyer and Bruza, 2012; Aerts et al., 2013; Aerts and Sozzo, 2014) especially language understanding (Bruza et al., 2008, 2009; Wang et al., 2016) exhibit certain non-classical phenomena (i.e. quantum-like phenomena), we propose a theoretical framework, named Semantic Hilbert Space, to formulate quantum-like phenomena in language understanding and to model different levels of semantic units in a unified space.
In Semantic Hilbert Space, we assume that words can be modeled as microscopic particles in superposition states, over the basic sememes (i.e. minimum semantic units in linguistics), while a combination of word meanings can be viewed as a mixed system of particles. The Semantic Hilbert Space represents different levels of semantic units, ranging from basic sememes, words and sentences, on a unified complex-valued vector space. This is fundamentally different from existing quantum-inspired neural networks for question answering (Zhang et al., 2018a, b) which are based on a real vector space. In addition, we introduce a new semantic abstraction, named as Semantic Measurements, which are also embedded in the same vector space and trainable to extract high-level features from the mixed system.
As shown in Fig. 1, the Semantic Hilbert Space is built on the basis of quantum probability (QP), which is the probability theory for explaining the uncertainty of quantum superposition. As quantum superposition requires the use of the complex field, Semantic Hilbert Space has complex values and operators. In particular, the probability function is implemented by a unique (complex) density operator.
Semantic Hilbert Space adopts a complex-valued vector representation of unit length, where each component adopts an amplitude-phase form . We hereby hypothesize that the amplitude and complex phase can be used to encode different levels of semantics such as lexical-level co-occurrence, hidden sentiment polarity or topic-level semantics. When word vectors are combined, even in a simple complex-valued addition form, the resulting expression will entail a non-linear composition of amplitudes and phases, thus indicating a complicated fusion of different levels of semantics. A more detailed explanation is given in Sec. 3. In this way, the complex-valued word embedding is fundamentally different from existing real-valued word embedding. A series of ablation tests indicate that the complex-valued word embedding can increase performance.
The Semantic Hilbert Space is an abstract representation of our approach to modeling language through QP. At the level of implementation, an efficient and effective computational framework is needed to cope with large text collections. To do so, we propose an end-to-end neural network architecture, which provides means for training of the network components. Each component corresponds to a physical meaning of quantum probability with well-defined mathematical constraints. Moreover, each component is easier to understand than the kernels in convolutional neural network and cells in recurrent neural networks.
The network proposed in this paper is evaluated on six benchmarking datasets for text classification and achieves a steady increase over existing models. Moreover, it is shown that the proposed network is advantageous due to its high robustness and self-explanation capability.
2. Semantic Hilbert Space
The mathematical foundation of Quantum Theory is established on a Hilbert Space over the complex field. In order to borrow the underlying mathematical formalism of quantum theory for language understanding, it is necessary to build such a Hilbert Space for language representation. In this study, we build a Semantic Hilbert Space over the complex field. As is illustrated in Fig. 1, multiple levels of semantic units are modeled on this common Semantic Hilbert Space. In the rest of this section, the semantic units under modeling are introduced separately.
We follow the standard Dirac Notation for Quantum Theory. A unit vector and its transpose are denoted as a ket and a bra , respectively. The inner product and outer product of two unit vectors and are denoted as and respectively.
2.1. Sememes
Sememes are the minimal non-separable semantic units of word meanings in language universals (Goddard and Wierzbicka, 1994). For example, the word “blacksmith” is composed of sememes “human”, “occupation”, “metal” and “industrial”. We assume that the Semantic Hilbert Space is spanned by a set of orthogonal basis corresponding to a finite closed set of sememes . In the quantum language, the set of sememes are modeled as basis states, which is the basis for representing any quantum state. In Fig. 1, the axes of the Semantic Hilbert Space correspond to the set of sememe states, and semantic units with larger granularity are represented based on quantum probability.
2.2. Words
The meaning of a word is a combination of sememes. We adopt the concept of superposition to formulate this combination. Essentially, a word is modeled as a quantum particle in superposition state, represented by a unit-length vector in the Semantic Hilbert Space , as can be seen in Fig. 1. It can be written as a linear combination of the basis states for sememes:
[TABLE]
where the complex-valued weight denotes how much the meaning of word is associated with the sememe . Here are non-negative real-valued amplitudes satisfying =1 and are the corresponding complex phases. We could also transfer the complex number in a complex plane as .
It is worth noting that the complex phases are crucial as they implicitly entail the quantum interference between words. Suppose two words and are associated to weights and for the sememe . The two words in combination are therefore at the state with a probability of
[TABLE]
where the term reflects the interference between the two words, where as the classical case corresponds to a particular case .
2.3. Semantic Compositions
As is illustrated in Fig. 1, we view a word composition (e.g. a sentence) as a bag of words (Harris, 1954), each of which is modeled as a particle in superposition state on the Semantic Hilbert Space . To obtain the semantic composition of words, we leverage the concept of quantum mixture and formulate the word composition as a mixed system composed of the word superposition states. The system is in a mixed state represented by a -by- density matrix on , which is positive semi-definite with trace 1. It is computed as follows:
[TABLE]
where denotes the superposition state of the -th word and is the classical probability of the state with . It determines the contribution of the word to the overall semantics.
The complex-valued density matrix can be seen non-classical distribution of sememes in . Its diagonal elements are real and form a classical distribution of sememes, while its complex-valued off–diagonal entries encode the interplay between sememes, which in turn gives rise to the interference between words. A density matrix assigns a probability value for any state on such that the values for any set of orthogonal states sum up to 1 (Gleason, 1957). Hence it is visualized as an ellipsoid in Fig. 1, assigning a quantum probability to a unit vector with the intersection length.
2.4. Semantic Measurements
As a non-classical probability distribution, a sentence density matrix carries rich information and in particular it contains all the information about a quantum system. In order to extract the relevant information to a concrete task from the semantic composition, we build a set of measurements and compute the probability that the mixed system falls onto each of the measurements as a high-level abstraction of the semantic composition.
Suppose our proposed semantic measurements are associated with a set of measurement projectors . According to the Born’s rule (Born, 1926), applying the measurement projector onto the sentence density matrix yields the following result:
[TABLE]
Here, we only consider pure states as measurement states, i.e. . Moreover, we ignore the constraints of the measurements states (i.e. orthogonality or completeness), but keep them trainable, so that the most suitable measurements can be determined automatically by the data in a concrete task, such as classification or regression. In this way, the trainable semantic measurements can be understood as a similar approach to supervised dimensionality reduction (Fisher, 1936), but in a quantum probability framework with complex values.
3. Quantum Probability Driven Network
In order to implement the proposed framework, we further propose an end-to-end neural network based on quantum probability. Fig. 2 shows the architecture of the proposed Quantum Probability Driven Network (QPDN). The embedding layer, which is composed of a unit complex-valued embedding and a term-weight lookup table, captures the basic lexical features. The mixture layer is designed to combine the low-level bag-of-word features with an additive complex-valued outer product operation. The measurement layer adopts a set of trainable semantic measurements to extract the higher-level features for the final linear classifier. In the following we will introduce the architecture layer by layer.
3.1. Embedding Layer
The parameters of the embedding layer are , respectively, denoting the amplitude embedding, the phase embedding, and the term-weight lookup table. Eq. 1 expresses a quantum representation as a unit-length, complex-valued vector representation for a word , i.e. . The term-weight lookup table is used to weight words for semantic combinations, which will be described in the next subsection. During training, word embeddings need to be normalized to unit length after each batch.
This representation allows for a non-linear composition of amplitudes and phases in its mathematical form. Suppose two words and are of weights and for the dimension (corresponding to the sememe). The combination of and gives rise to a weight for the dimension computed as
[TABLE]
Where both and is a non-linear combination of ,, and . If the amplitudes and phases are associated to different levels of information, the amplitude-phase representation then naturally gives rise to a non-linear fusion of information.
3.2. Mixture Layer
A sentence is modeled as a density matrix, which is constructed in Sec. 2.3. Instead of using uniform weights in Eq. 3, word-sensitive weights are used for each word, which is commonly used in IR, e.g. inverse document frequency (IDF) as a word-dependent weight in TF-IDF scheme (Sparck Jones, 1972).
In order to guarantee the unit trace length for density matrix, the word weights which are from the lookup table in a sentence are normalized to a probability value through a softmax operation: Compared to the IDF weight, the normalized weight for a specific word in our approach is not static but updated adaptively in the training phase. Even in the inference/test phase, the real term weight i.e. is also not static, but highly depends on the neighbor context words through nonlinear softmax function.
3.3. Measurement Layer
The measurement layer adopts a set of one-rank measurement projectors where is the outer product of its corresponding state in Semantic Hilbert Space . After each measurement, we can obtain one probability for each measurement state like . Finally, we can obtain a vector . Similarly to the word vectors, the states are represented as unit states and normalized after several batches.
3.4. Dense Layer
The vector in the measurement layer consists of positive scalar numbers and it is used to infer the label for a given sentence. A dense layer with softmax activation is adopted after the measurement layer to get a classification probability distribution, i.e. . The loss is designed as a cross-entropy loss between and the one-hot label .
4. Experiments
Our model was evaluated on six datasets for text classification: CR customer review (Hu and Liu, 2014), MPQA opinion polarity (Wiebe et al., 2005), SUBJ sentence subjectivity (Pang and Lee, 2005), MR movie review (Pang and Lee, 2005), SST binary sentiment classification (Socher et al., 2013), and TREC question classification (Li and Roth, 2002). The statistics of them are shown in Tab. 1.
We compared the proposed QPDN with various models, including Uni-TFIDF, Word2vec, FastText (Joulin et al., 2016) and Sent2Vec (Pagliardini et al., 2018) as unsupervised representation learning baselines, CaptionRep (Hill et al., 2016a) and DictRep (Hill et al., 2016b) as supervised representation learning baselines, as well as CNN (Kim, 2014) and BiLSTM (Conneau et al., 2017b) for advanced deep neural networks. In Tab. 2, we reported the classification accuracy values of these models from the original papers.
We used Glove word vectors (Pennington et al., 2014) with 50,100,200 and 300 dimensions respectively. The amplitude embedding values are initialized by L2-norm, while the phases in complex-valued embedding are randomly initialized in to . We searched for the best performance in a parameter pool, which contains a learning rate in , an L2-regularization ratio in , a batch size in , and the number of measurements in .
The main parameters in our model are and . Since both of them are in shape, the number of parameters is roughly two times that of FastText (Mikolov et al., 2013). For the other parameters, is , is , while is with being the label set. Apart from word embeddings, the model is robust with limited scale at for the number of parameters.
The results in Tab. 2 demonstrate the effectiveness of our model and an improvement of classification accuracies over some strong baseline supervised and unsupervised representation models on most of the datasets except MPQA. In comparison with more advanced models including BiLSTM and CNN, our model generally performed better than BiLSTM with increased accuracy values on the multi-class classification dataset (TREC) and three binary text classification datasets (MR, SST & SUBJ). However, it under-performed CNN on all 6 datasets with a difference of over 2% on 3 of them (MPQA, SST & TREC), probably because that it uses fewer parameters and simpler structures.
We argue that QPDN achieved a good balance between effectiveness and efficiency, due to the fact that it outperforms BiLSTM.
5. Discussions
This section discusses the power of self-explanation and conducts an ablation test to examine the usefulness of important components of the network, especially the complex-valued word embedding.
Self-explanation Components
As is shown in Tab. LABEL:table:Interpretability, all components in our model have a clear physical meaning corresponding to quantum probability, where classical Deep Neural Network (DNN) can not well explain the role each component plays in the network. Essentially, we constructed a bottom-up framework to represent each level of semantic units on a uniform Semantic Hilbert Space, from the minimum semantic unit, i.e. sememe, to the sentence representation. The framework was operationalized through superposition, mixture and semantic measurements.
Ablation Test
An ablation test was conducted to examine how each component influences the final performance of QPDN. In particular, a double-length real word embedding network was implemented to examine the use of complex-valued word embedding, while mean weights and IDF weights were compared with our proposed trainable weights. A set of non-trainable orthogonal projectors and a dense layer on top of the sentence density matrix were implemented to analyze the effect of trainable semantic measurements.
Due to limited space, we only reported the ablation test result for SST, which is the largest and hence the most representative dataset. We used 100-dimensional real-valued word vectors and 50-dimensional complex-valued vectors for the models in the ablation test. All models under ablation were comparable in terms of time cost. Tab. 4 showed that each component plays an important role in the QPDN model. In particular, replacing complex embedding with double-dimension real word embedding led to a 5% drop in performance, which indicates that the complex-valued word embedding was not merely doubling the number of parameters.
The comparison with IDF and mean weights showed that the data-driven scheme gave rise to high-quality word weights. The comparison with non-trainable projectors and directly applying a dense layer on the density matrix showed that trainable measurements bring benefits to the network.
Discriminative Semantic Directions
In order to better understand the well-trained measurement projectors, we obtained the top 10 nearest words in the complex-valued vector space for each trained measurement state (like ), using KD tree (Bentley, 1975). Due to limited space, we selected five measurements from the trained model for the MR dataset, and selected words from the top 10 nearest words to each measurement. As can be seen in Tab. 5, the first measurement was roughly about time, the second one was related to verb words which mainly mean ‘motivating others’. The third measurement grouped uncommon non-English words together. The last two measurements also grouped words sharing similar meanings. It is therefore interesting to see that relevant words can somehow be grouped together into certain topics during the training process, which may be discriminative for the given task.
6. Conclusions
In order to better model the non-linearity of word semantic composition, we have developed a quantum-inspired framework that models different granularities of semantic units on the same Semantic Hilbert Space, and implement this framework into an end-to-end text classification network. The network showed a promising performance on six benchmarking text datasets, in terms of effectiveness, robustness and self-explanation ability. Moreover, the complex-valued word embedding, which inherently achieved non-linear combination of word meanings, brought benefits to the classification accuracy in a comprehensive ablation study.
This work is among the first step to apply the quantum probabilistic framework to text modeling. We believe it is a promising direction. In the future, we would like to further extend this work by considering deeper and more complicated structures such as attention or memory mechanism in language, in order to investigate related quantum-like phenomena on textual data to provide more intuitive insights. Additionally, Semantic Hilbert Space in a tensor space is also worthy to be explored like (Zhang et al., 2018b), which may provide more interesting insights for current communities.
ACKNOWLEDGEMENT
This work is supported by the Quantum Access and Retrieval Theory (QUARTZ) project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 721321, and by the National Key Research and Development Program of China (grant No. 2018YFC0831700), Natural Science Foundation of China (grant No. U1636203), and Major Project of Zhejiang Lab (grant No. 2019DH0ZX01).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Aerts et al . (2013) Diederik Aerts, Liane Gabora, and Sandro Sozzo. 2013. Concepts and Their Dynamics: A Quantum-Theoretic Modeling of Human Thought. Topics in Cognitive Science (Sept. 2013). https://doi.org/10.1111/tops.12042 ar Xiv: 1206.1069. · doi ↗
- 3Aerts and Sozzo (2014) Diederik Aerts and Sandro Sozzo. 2014. Quantum Entanglement in Concept Combinations. International Journal of Theoretical Physics 53, 10 (Oct. 2014), 3587–3603. https://doi.org/10.1007/s 10773-013-1946-z · doi ↗
- 4Arora et al . (2016) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. (Nov. 2016). https://openreview.net/forum?id=Sy K 00v 5xx
- 5Bentley (1975) Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509–517.
- 6Born (1926) Max Born. 1926. Zur Quantenmechanik der Sto\s svorgänge. Zeitschrift für Physik 37, 12 (Dec. 1926), 863–867. https://doi.org/10.1007/BF 01397477 · doi ↗
- 7Bruza et al . (2009) Peter Bruza, Kirsty Kitto, Douglas Nelson, and Cathy Mc Evoy. 2009. Is there something quantum-like about the human mental lexicon? Journal of Mathematical Psychology 53, 5 (2009), 362–377.
- 8Bruza et al . (2008) Peter D Bruza, Kirsty Kitto, Douglas Mc Evoy, and Cathy Mc Evoy. 2008. Entangling words and meaning. (2008).
