TL;DR
This paper introduces a generative latent variable model for learning multilingual word representations from dictionaries, enabling robust alignment across languages and performing well despite noisy data.
Contribution
It presents a novel offline approach using a generative model to align multilingual embeddings, improving robustness and alignment quality.
Findings
Achieves competitive results on multilingual tasks
Robust to noise in embedding space
Effective for distributed representations from noisy corpora
Abstract
In this work we approach the task of learning multilingual word representations in an offline manner by fitting a generative latent variable model to a multilingual dictionary. We model equivalent words in different languages as different views of the same word generated by a common latent variable representing their latent lexical meaning. We explore the task of alignment by querying the fitted model for multilingual embeddings achieving competitive results across a variety of tasks. The proposed model is robust to noise in the embedding space making it a suitable method for distributed representations learned from noisy corpora.
| Method | en-es | es-en | en-fr | fr-en | en-de | de-en | en-ru | ru-en | en-zh | zh-en |
|---|---|---|---|---|---|---|---|---|---|---|
| Supervised | ||||||||||
| SVD | 77.4 | 77.3 | 74.9 | 76.1 | 68.4 | 67.7 | 47.0 | 58.2 | 27.3* | 09.3* |
| IBFA | 79.5 | 81.5 | 77.3 | 79.5 | 70.7 | 72.1 | 46.7 | 61.3 | 42.9 | 36.9 |
| SVD+CSLS | 81.4 | 82.9 | 81.1 | 82.4 | 73.5 | 72.4 | 51.7 | 63.7 | 32.5* | 25.1* |
| IBFA+CSLS | 81.7 | 84.1 | 81.9 | 83.4 | 74.1 | 75.7 | 50.5 | 66.3 | 48.4 | 41.7 |
| Semi-supervised | ||||||||||
| SVD | 65.9 | 74.1 | 71.0 | 72.7 | 60.3 | 65.3 | 11.4 | 37.7 | 06.8 | 00.8 |
| IBFA | 76.1 | 80.1 | 77.1 | 78.9 | 66.8 | 71.8 | 23.1 | 39.9 | 17.1 | 24.0 |
| AdvR | 79.1 | 78.1 | 78.1 | 78.2 | 71.3 | 69.6 | 37.3 | 54.3 | 30.9 | 21.9 |
| SVD+CSLS | 73.0 | 80.7 | 75.7 | 79.6 | 65.3 | 70.8 | 20.9 | 41.5 | 10.5 | 01.7 |
| IBFA+CSLS | 76.5 | 83.7 | 78.6 | 82.3 | 68.7 | 73.7 | 25.3 | 46.3 | 22.1 | 27.2 |
| AdvR+CSLS | 81.7 | 83.3 | 82.3 | 82.1 | 74.0 | 72.2 | 44.0 | 59.1 | 32.5 | 31.4 |
| English to Italian | Italian to English | English to Italian | Italian to English | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @1 | @5 | @10 | @1 | @5 | @10 | @1 | @5 | @10 | @1 | @5 | @10 | |
| Mikolov et. al. | 33.8 | 48.3 | 53.9 | 24.9 | 41.0 | 47.4 | 1.0 | 2.8 | 3.9 | 2.5 | 6.4 | 9.1 |
| CCA (Sklearn) | 36.1 | 52.7 | 58.1 | 31.0 | 49.9 | 57.0 | 29.1 | 46.4 | 53.0 | 27.0 | 47.0 | 52.3 |
| CCA | 30.9 | 48.1 | 52.7 | 27.7 | 45.5 | 51.0 | 26.5 | 42.5 | 48.1 | 22.8 | 40.1 | 45.5 |
| SVD | 36.9 | 52.7 | 57.9 | 32.2 | 49.6 | 55.7 | 27.1 | 43.4 | 49.3 | 26.2 | 42.1 | 49.0 |
| IBFA (Ours) | 39.3 | 55.3 | 60.1 | 34.7 | 53.5 | 59.4 | 34.7 | 52.6 | 58.3 | 33.7 | 53.3 | 59.2 |
| Embeddings | WS | WS-SIM | WS-REL | RG-65 | MC-30 | MT-287 | MT-771 | MEN-TR |
|---|---|---|---|---|---|---|---|---|
| English | 73.7 | 78.1 | 68.2 | 79.7 | 81.2 | 67.9 | 66.9 | 76.4 |
| IBFA en-de | 74.4 | 79.4 | 68.3 | 81.4 | 84.2 | 67.2 | 69.4 | 77.8 |
| IBFA en-fr | 72.4 | 77.8 | 65.8 | 80.5 | 83.0 | 68.2 | 69.6 | 77.6 |
| IBFA en-es | 73.6 | 78.5 | 67.0 | 79.0 | 83.0 | 68.2 | 69.4 | 77.3 |
| CCA en-de | 71.7 | 76.4 | 64.0 | 76.7 | 82.4 | 63.0 | 64.7 | 75.3 |
| CCA en-fr | 70.9 | 76.4 | 63.3 | 76.5 | 81.4 | 63.4 | 65.4 | 74.9 |
| CCA en-es | 70.8 | 76.3 | 63.1 | 76.4 | 81.2 | 63.0 | 65.1 | 74.7 |
| Embeddings | STS12 | STS13* | STS14 | STS15 | STS16 |
|---|---|---|---|---|---|
| English | 58.1 | 69.2 | 66.7 | 72.6 | 70.6 |
| IBFA en-de | 58.1 | 70.2 | 66.8 | 73.0 | 71.6 |
| IBFA en-fr | 58.0 | 70.0 | 66.7 | 72.8 | 71.4 |
| IBFA en-es | 57.9 | 69.7 | 66.6 | 72.9 | 71.7 |
| CCA en-de | 56.7 | 67.5 | 65.7 | 73.1 | 70.5 |
| CCA en-fr | 56.7 | 67.9 | 65.9 | 72.8 | 70.8 |
| CCA en-es | 56.6 | 67.8 | 65.9 | 72.9 | 70.8 |
| English to Italian | Italian to English | |||||
|---|---|---|---|---|---|---|
| @1 | @5 | @10 | @1 | @5 | @10 | |
| Mikolov et. al. | 10.5 | 18.7 | 22.8 | 12.0 | 22.1 | 26.7 |
| Dinu et al. | 45.3 | 72.4 | 80.7 | 48.9 | 71.3 | 78.3 |
| Smith et al. | 54.6 | 72.7 | 78.2 | 42.9 | 62.2 | 69.2 |
| SVD | 40.5 | 52.6 | 56.9 | 51.2 | 63.7 | 67.9 |
| IBFA (Ours) | 62.7 | 74.2 | 77.9 | 64.1 | 75.2 | 79.5 |
| SVD + CSLS | 64.0 | 75.8 | 78.5 | 67.9 | 79.4 | 82.8 |
| AdvR + CSLS | 66.2 | 80.4 | 83.4 | 58.7 | 76.5 | 80.9 |
| IBFA + CSLS | 68.8 | 80.7 | 83.5 | 70.2 | 80.8 | 84.8 |
| Method | en-it | it-en | en-fr | fr-en | it-fr | fr-it |
|---|---|---|---|---|---|---|
| SVD | 71.0 | 72.4 | 74.9 | 76.1 | 78.3 | 72.9 |
| MBFA | 71.9 | 73.4 | 76.7 | 78.1 | 82.6 | 77.5 |
| SVD+CSLS | 76.2 | 77.9 | 81.1 | 82.4 | 84.5 | 79.8 |
| MBFA+CSLS | 77.4 | 77.7 | 81.9 | 82.1 | 86.8 | 81.9 |
| en | es | esen |
|---|---|---|
| particular | efectivamente | effectively |
| correspondingly | esto | this |
| silly | irónicamente | ironic |
| frightening | brutalidad | brutality |
| manipulations | intencionadamente | intentionally |
| ignore | contraproducente | counter-productive |
| fundamentally | entendido | understood |
| embarrassed | enojado | angry |
| terrified | casualidad | coincidence |
| hypocritical | obviamente | obviously |
| wondered | incómodo | uncomfort-able |
| oftentimes | apostar | betting |
| unwittingly | traicionar | betray |
| mishap | irónicamente | ironically |
| veritable | empero | however |
| overpowered | deshacerse | fall apart |
| crazed | divertidos | merry |
| frightening | ironía | irony |
| dreadful | desesperación | despair |
| instituting | restablecimiento | recover |
| unrealistic | cuestionamiento | questioning |
| regrettable | erróneos | mistaken |
| irresponsible | preocupaciones | concerns |
| obsession | irremediablemente | hopelessly |
| embodied | voluntad | will |
| misguided | esconder | conceal |
| perspective | contestación | answer |
| reactionary | conservadurismo | conservatism |
| Method | EN-IT | IT-EN | EN-FR | FR-EN | IT-FR | FR-IT |
|---|---|---|---|---|---|---|
| MBFA-1K | 71.9 | 73.3 | 76.7 | 78.2 | 82.4 | 77.5 |
| MBFA-20K | 71.9 | 73.4 | 76.7 | 78.1 | 82.6 | 77.5 |
| MBFA-1K+CSLS | 77.5 | 77.6 | 81.9 | 82.0 | 86.8 | 82.1 |
| MBFA-20K+CSLS | 77.4 | 77.7 | 81.9 | 82.1 | 86.8 | 81.9 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\SelectInputMappings
aacute=á, ntilde=ñ, Euro=€
Multilingual Factor Analysis
Francisco Vargas, Kamen Brestnichki, Alex Papadopoulos-Korfiatis
Nils Hammerla
Babylon Health
{firstname.lastname, alex.papadopoulos}@babylonhealth.com
Abstract
In this work we approach the task of learning multilingual word representations in an offline manner by fitting a generative latent variable model to a multilingual dictionary. We model equivalent words in different languages as different views of the same word generated by a common latent variable representing their latent lexical meaning. We explore the task of alignment by querying the fitted model for multilingual embeddings achieving competitive results across a variety of tasks. The proposed model is robust to noise in the embedding space making it a suitable method for distributed representations learned from noisy corpora.
1 Introduction
Popular approaches for multilingual alignment of word embeddings base themselves on the observation in Mikolov et al. (2013a), which noticed that continuous word embedding spaces (Mikolov et al., 2013b; Pennington et al., 2014; Bojanowski et al., 2017; Joulin et al., 2017) exhibit similar structures across languages. This observation has led to multiple successful methods in which a direct linear mapping between the two spaces is learned through a least squares based objective Mikolov et al. (2013a); Smith et al. (2017); Xing et al. (2015) using a paired bilingual dictionary.
An alternate set of approaches based on Canonical Correlation Analysis (CCA) Knapp (1978) seek to project monolingual embeddings into a shared multilingual space (Faruqui and Dyer, 2014b; Lu et al., 2015). Both these methods aim to exploit the correlations between the monolingual vector spaces when projecting into the aligned multilingual space. The multilingual embeddings from (Faruqui and Dyer, 2014b; Lu et al., 2015) are shown to improve on word level semantic tasks, which sustains the authors’ claim that multilingual information enhances semantic spaces.
In this paper we present a new non-iterative method based on variants of factor analysis (Browne, 1979; McDonald, 1970; Browne, 1980) for aligning monolingual representations into a multilingual space. Our generative modelling assumes that a single word translation pair is generated by an embedding representing the lexical meaning of the underlying concept. We achieve competitive results across a wide range of tasks compared to state-of-the-art methods, and we conjecture that our multilingual latent variable model has sound generative properties that match those of psycholinguistic theories of the bilingual mind Weinreich (1953). Furthermore, we show how our model extends to more than two languages within the generative framework which is something that previous alignment models are not naturally suited to, instead resorting to combining bilingual models with a pivot as in Ammar et al. (2016).
Additionally the general benefit of the probabilistic setup as discussed in Tipping and Bishop (1999) is that it offers the potential to extend the scope of conventional alignment methods to model and exploit linguistic structure more accurately. An example of such a benefit could be modelling how corresponding word translations can be generated by more than just a single latent concept. This assumption can be encoded by a mixture of Factor Analysers Ghahramani et al. (1996) to model word polysemy in a similar fashion to Athiwaratkun and Wilson (2017), where mixtures of Gaussians are used to reflect the different meanings of a word.
The main contribution of this work is the application of a well-studied graphical model to a novel domain, outperforming previous approaches on word and sentence-level translation retrieval tasks. We put the model through a battery of tests, showing it aligns embeddings across languages well, while retaining performance on monolingual word-level and sentence-level tasks. Finally, we apply a natural extension of this model to more languages in order to align three languages into a single common space.
2 Background
Previous work on the topic of embedding alignment has assumed that alignment is a directed procedure — i.e. we want to align French to English embeddings. However, another approach would be to align both to a common latent space that is not necessarily the same as either of the original spaces. This motivates applying a well-studied latent variable model to this problem.
2.1 Factor Analysis
Factor analysis (Spearman, 1904; Thurstone, 1931) is a technique originally developed in psychology to study the correlation of latent factors on observed measurements . Formally:
[TABLE]
In order to learn the parameters of the model we maximise the marginal likelihood with respect to . The maximum likelihood estimates of these procedures can be used to obtain latent representations for a given observation . Such projections have been found to be generalisations of principal component analysis Pearson (1901) as studied in Tipping and Bishop (1999).
2.2 Inter-Battery Factor Analysis
Inter-Battery Factor Analysis (IBFA) Tucker (1958); Browne (1979) is an extension of factor analysis that adapts it to two sets of variables (i.e. embeddings of two languages). In this setting it is assumed that pairs of observations are generated by a shared latent variable
[TABLE]
As in traditional factor analysis, we seek to estimate the parameters that maximise the marginal likelihood
[TABLE]
where the joint marginal is a Gaussian with the form
[TABLE]
and means is positive definite.
Maximising the likelihood as in Equation 2 will find the optimal parameters for the generative process described in Figure 1 where one latent is responsible for generating a pair . This makes it a suitable objective for aligning the vector spaces of in the latent space. In contrast to the discriminative directed methods in (Mikolov et al., 2013a; Smith et al., 2017; Xing et al., 2015), IBFA has the capacity to model noise.
We can re-interpret the logarithm of Equation 2 (as shown in Appendix D) as
[TABLE]
The exact expression for is given in the same appendix. This interpretation shows that for each pair of points, the objective is to minimise the reconstruction errors of and , given a projection into the latent space . By utilising the symmetry of Equation 2, we can show the converse is true as well — maximising the joint probability also minimises the reconstruction loss given the latent projections . Thus, this forces the latent embeddings of and to be close in the latent space. This provides intuition as to why embedding into this common latent space is a good alignment procedure.
In (Browne, 1979; Bach and Jordan, 2005) it is shown that the maximum likelihood estimates for can be attained in closed form
[TABLE]
where
[TABLE]
The projections into the latent space from are given by (as proved in Appendix B)
[TABLE]
Evaluated at the MLE, Bach and Jordan (2005) show that Equation 4 can be reduced to
[TABLE]
2.2.1 Multiple-Battery Factor Analysis
Multiple-Battery Factor Analysis (MBFA) (McDonald, 1970; Browne, 1980) is a natural extension of IBFA that models more than two views of observables (i.e. multiple languages), as shown in Figure 2.
Formally, for a set of views , we can write the model as
[TABLE]
Similar to IBFA the projections to the latent space are given by Equation 4, and the marginal yields a very similar form
[TABLE]
Unlike IBFA, a closed form solution for maximising the marginal likelihood of MBFA is unknown. Because of this, we have to resort to iterative approaches as in Browne (1980) such as the natural extension of the EM algorithm proposed by Bach and Jordan (2005). Defining
[TABLE]
the EM updates are given by
[TABLE]
where is the sample covariance matrix of the concatenated views (derivation provided in Appendix E). Browne (1980) shows that, under suitable conditions, the MLE of the parameters of MBFA is uniquely identifiable (up to a rotation that does not affect the method’s performance). We observed this in an empirical study — the solutions we converge to are always a rotation away from each other, irrespective of the parameters’ initialisation. This heavily suggests that any optimum is a global optimum and thus we restrict ourselves to only reporting results we observed when fitting from a single initialisation. The chosen initialisation point is provided as Equation (3.25) of Browne (1980).
3 Multilingual Factor Analysis
We coin the term Multilingual Factor Analysis for the application of methods based on IBFA and MBFA to model the generation of multilingual tuples from a shared latent space. We motivate our generative process with the compound model for language association presented by Weinreich (1953). In this model a lexical meaning entity (a concept) is responsible for associating the corresponding words in the two different languages.
We note that the structure in Figure 3 is very similar to our graphical model for IBFA specified in Figure 1. We can interpret our latent variable as the latent lexical concept responsible for associating (generating) the multilingual language pairs. Most theories that explain the interconnections between languages in the bilingual mind assume that “while phonological and morphosyntactic forms differ across languages, meanings and/or concepts are largely, if not completely, shared” (Pavlenko, 2009). This shows that our generative modelling is supported by established models of language interconnectedness in the bilingual mind.
Intuitively, our approach can be summarised as transforming monolingual representations by mapping them to a concept space in which lexical meaning across languages is aligned and then performing retrieval, translation and similarity-based tasks in that aligned concept space.
3.1 Comparison to Direct Methods
Methods that learn a direct linear transformation from to , such as Mikolov et al. (2013a); Artetxe et al. (2016); Smith et al. (2017); Lample et al. (2018) could also be interpreted as maximising the conditional likelihood
[TABLE]
As shown in Appendix F, the maximum likelihood estimate for does not depend on the noise term . In addition, even if one were to fit , it is not clear how to utilise it to make predictions as the conditional expectation
[TABLE]
does not depend on the noise parameters. As this method is therefore not robust to noise, previous work has used extensive regularisation (i.e. by making orthogonal) to avoid overfitting.
3.2 Relation to CCA
CCA is a popular method used for multilingual alignment which is very closely related to IBFA, as detailed in Bach and Jordan (2005). Barber (2012) shows that CCA can be recovered as a limiting case of IBFA with constrained diagonal covariance , as . CCA assumes that the emissions from the latent spaces to the observables are deterministic. This is a strong and unrealistic assumption given that word embeddings are learned from noisy corpora and stochastic learning algorithms.
4 Experiments
In this section, we empirically demonstrate the effectiveness of our generative approach on several benchmarks, and compare it with state-of-the-art methods. We first present cross-lingual (word-translation) evaluation tasks to evaluate the quality of our multi-lingual word embeddings. As a follow-up to the word retrieval task we also run experiments on cross-lingual sentence retrieval tasks. We further demonstrate the quality of our multi-lingual word embeddings on monolingual word- and sentence-level similarity tasks from Faruqui and Dyer (2014b), which we believe provides empirical evidence that the aligned embeddings preserve and even potentially enhance their monolingual quality.
4.1 Word Translation
This task is concerned with the problem of retrieving the translation of a given set of source words. We reproduce results in the same environment as Lample et al. (2018)111github.com/Babylonpartners/MultilingualFactorAnalysis, based on github.com/facebookresearch/MUSE. for a fair comparison. We perform an ablation study to assess the effectiveness of our method in the Italian to English (it-en) setting in Smith et al. (2017); Dinu et al. (2014). In these experiments we are interested in studying the effectiveness of our method compared to that of the Procrustes-based fitting used in Smith et al. (2017) without any post-processing steps to address the hubness problem (Dinu et al., 2014). In Table 1 we observe how our model is competitive to the results in Lample et al. (2018) and outperforms them in most cases. We notice that given an expert dictionary, our method performs the best out of all compared methods on all tasks, except in English to Russian (en-ru) translation where it remains competitive. What is surprising is that, in the semi-supervised setting, IBFA bridges the gap between the method proposed in Lample et al. (2018) on languages where the dictionary of identical tokens across languages (i.e. the pseudo-dictionary from Smith et al. (2017)) is richer. However, even though it significantly outperforms SVD using the pseudo-dictionary, it cannot match the performance of the adversarial approach for more distant languages like English and Chinese (en-zh).
4.1.1 Detailed Comparison to Basic SVD
We present a more detailed comparison to the SVD method described in Smith et al. (2017). We focus on methods in their base form, that is without post-processing techniques, i.e. cross-domain similarity local scaling (CSLS) Lample et al. (2018) or inverted softmax (ISF) Smith et al. (2017). Note that Smith et al. (2017) used the scikit-learn 222A commonly used Python library for scientific computing, found at Pedregosa et al. (2011). implementation of CCA, which uses an iterative estimation of partial least squares. This does not give the same results as the standard CCA procedure. In Table 2 we reproduce the results from Smith et al. (2017) using the dictionaries and embeddings provided by Dinu et al. (2014)333https://zenodo.org/record/2654864 and we compare our method (IBFA) using both the expert dictionaries from Dinu et al. (2014) and the pseudo-dictionaries as constructed in Smith et al. (2017). We significantly outperform both SVD and CCA, especially when using the pseudo-dictionaries.
4.2 Word Similarity Tasks
This task assesses the monolingual quality of word embeddings. In this experiment, we fit both considered methods (CCA and IBFA) on the entire available dictionary of around 100k word pairs. We compare to CCA as used in Faruqui and Dyer (2014b) and standard monolingual word embeddings on the available tasks from Faruqui and Dyer (2014b). We evaluate our multilingual embeddings on the following tasks: WS353 Finkelstein et al. (2002); WS-SIM, WS-REL Agirre et al. (2009); RG65 Rubenstein and Goodenough (1965); MC-30 Miller and Charles (1991); MT-287; Radinsky et al. (2011); MT-771 Halawi et al. (2012), and MEN-TR Bruni et al. (2012). These tasks consist of English word pairs that have been assigned ground truth similarity scores by humans. We use the test-suite provided by Faruqui and Dyer (2014a)444https://github.com/mfaruqui/eval-word-vectors to evaluate our multilingual embeddings on these datasets. This test-suite calculates similarity of words through cosine similarity in their representation spaces and then reports Spearman correlation with the ground truth similarity scores provided by humans.
As shown in Table 3, we observe a performance gain over CCA and monolingual word embeddings suggesting that we not only preserve the monolingual quality of the embeddings but also enhance it.
4.3 Monolingual Sentence Similarity Tasks
Semantic Textual Similarity (STS) is a standard benchmark used to assess sentence similarity metrics (Agirre et al., 2012, 2013, 2014, 2015, 2016). In this work, we use it to show that our alignment procedure does not degrade the quality of the embeddings at the sentence level. For both IBFA and CCA, we align English and one other language (from French, Spanish, German) using the entire dictionaries (of about 100k word pairs each) provided by Lample et al. (2018). We then use the procedure defined in Arora et al. (2016) to create sentence embeddings and use cosine similarity to output sentence similarity using those embeddings. The method’s performance on each set of embeddings is assessed using Spearman correlation to human-produced expert similarity scores. As evidenced by the results shown in Table 4, IBFA remains competitive using any of the three languages considered, while CCA shows a performance decrease.
4.4 Crosslingual Sentence Similarity Tasks
Europarl (Koehn, 2005) is a parallel corpus of sentences taken from the proceedings of the European parliament. In this set of experiments, we focus on its English-Italian (en-it) sub-corpus, in order to compare to previous methods. We report results under the framework of Lample et al. (2018). That is, we form sentence embeddings using the average of the tf-idf weighted word embeddings in the bag-of-words representation of the sentence. Performance is averaged over 2,000 randomly chosen source sentence queries and 200k target sentences for each language pair. Note that this is a different set up to the one presented in Smith et al. (2017), in which an unweighted average is used. The results are reported in Table 5. As we can see, IBFA outperforms all prior methods both using nearest neighbour retrieval, where it has a gain of 20 percent absolute on SVD, as well as using the CSLS retrieval metric.
4.5 Alignment of three languages
In an ideal scenario, when we have languages, we wouldn’t want to train a transformation between each pair, as that would involve storing matrices. One way to overcome this problem is by aligning all embeddings to a common space. In this exploratory experiment, we constrain ourselves to aligning three languages at the same time, but the same methodology could be applied to an arbitrary number of languages. MBFA, the extension of IBFA described in Section 2.2.1 naturally lends itself to this task. What is needed for training this method is a dictionary of word triples across the three languages considered. We construct such a dictionary by taking the intersection of all 6 pairs of bilingual dictionaries for the three languages provided by Lample et al. (2018). We then train MBFA for 20,000 iterations of EM (a brief analysis of convergence is provided in Appendix G). Alternatively, with direct methods like Smith et al. (2017); Lample et al. (2018) one could align all languages to English and treat that as the common space.
We compare both approaches and present their results in Table 6. As we can see, both methods experience a decrease in overall performance when compared to models fitted on just a pair of languages, however MBFA performs better overall. That is, the direct approaches preserve their performance on translation to and from English, but translation from French to Italian decreases significantly. Meanwhile, MBFA suffers a decrease in each pair of languages, however it retains competitive performance to the direct methods on English translation. It is worth noting that as the number of aligned languages increases, there are pairs of languages, one of which is English, and pairs in which English does not participate. This suggests that MBFA may generalise past three simultaneously aligned languages better than the direct methods.
4.6 Generating Random Word Pairs
We explore the generative process of IBFA by synthesising word pairs from noise, using a trained English-Spanish IBFA model. We follow the generative process specified in Equation 1 to generate 2,000 word vector pairs and then we find the nearest neighbour vector in each vocabulary and display the corresponding words. We then rank these 2,000 pairs according to their joint probability under the model and present the top 28 samples in Table 7. Note that whilst the sampled pairs are not exact translations, they have closely related meanings. The examples we found interesting are dreadful and despair; frightening and brutality; crazed and merry; unrealistic and questioning; misguided and conceal; reactionary and conservatism.
5 Conclusion
We have introduced a cross-lingual embedding alignment procedure based on a probabilistic latent variable model, that increases performance across various tasks compared to previous methods using both nearest neighbour retrieval, as well as the CSLS criterion. We have shown that the resulting embeddings in this aligned space preserve their quality by presenting results on tasks that assess word and sentence-level monolingual similarity correlation with human scores. The resulting embeddings also significantly increase the precision of sentence retrieval in multilingual settings. Finally, the preliminary results we have shown on aligning more than two languages at the same time provide an exciting path for future research.
Appendix A Joint Distribution
We show the form of the joint distribution for 2 views. Concatenating our data and parameters as below, we can use Equation (3) of Ghahramani et al. (1996) to write
[TABLE]
It is clear that this generalises to any number of views of any dimension, as the concatenation operation does not make any assumptions.
Appendix B Projections to Latent Space
We can query the joint Gaussian in 18 using rules from Petersen et al. (2008) Sections (8.1.2, 8.1.3) and we get
[TABLE]
Appendix C Derivation for the Marginal Likelihood
We want to compute so that we can then learn the parameters , by maximising the marginal likelihood as is done in Factor Analysis.
From the joint , again using rules from Petersen et al. (2008) Sections (8.1.2) we get
[TABLE]
For the case of two views, the joint probability can be factored as
[TABLE]
where
[TABLE]
Appendix D Scaled Reconstruction Errors
[TABLE]
Setting , we can re-parametrise as
[TABLE]
Appendix E Expectation Maximisation for MBFA
Define
[TABLE]
[TABLE]
Hence
[TABLE]
This follows the same form as regular factor analysis, but with a block-diagonal constraint on . Thus by Equations (5) and (6) of Ghahramani et al. (1996), we apply EM as follows.
E-Step: Compute and given the parameters .
[TABLE]
where
[TABLE]
Equation 26 is obtained by applying the Woodbury identity, and Equation 27 by applying the closely related push-through identity, as found in Section 3.2 of Petersen et al. (2008).
M-Step: Update parameters .
Define
[TABLE]
By first observing
[TABLE]
update the parameters as follows.
[TABLE]
Imposing the block diagonal constraint,
[TABLE]
where .
Appendix F Independence to Noise in Direct Methods
We are maximising the following quantity with respect to
[TABLE]
Then the partial derivative is proportional to
[TABLE]
The maximum likelihood is achieved when
[TABLE]
and since has an inverse (namely ), this means that
[TABLE]
It is clear from here that the MLE of does not depend on , thus we can conclude that adding a noise parameter to this directed linear model has no effect on its predictions.
Appendix G Learning curve of EM
Figure 4 shows the negative log-likelihood of the three language model over the first 5,000 iterations. The precision of the learned model is very close when evaluated at iteration 1,000 and at iteration 20,000 as seen in Table 8. This suggests that the model need not be trained to full convergence to work well.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Agirre et al. (2009) Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pages 19–27. Association for Computational Linguistics.
- 2Agirre et al. (2015) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al. 2015. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (Sem Eval 2015) , pages 252–263.
- 3Agirre et al. (2014) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (Sem Eval 2014) , pages 81–91.
- 4Agirre et al. (2016) Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (Sem Eval-2016) , pages 497–511.
- 5Agirre et al. (2013) Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. * sem 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity , volume 1, pages 32–43.
- 6Agirre et al. (2012) Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation , pages 385–393. Association for Computational Linguistics.
- 7Ammar et al. (2016) Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A Smith. 2016. Massively multilingual word embeddings. ar Xiv preprint ar Xiv:1602.01925 .
- 8Arora et al. (2016) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A simple but tough-to-beat baseline for sentence embeddings. International Conference on Learning Representations, 2017 .
