Multilingual Factor Analysis

Francisco Vargas; Kamen Brestnichki; Alex Papadopoulos-Korfiatis and; Nils Hammerla

arXiv:1905.05547·cs.LG·October 25, 2019

Multilingual Factor Analysis

Francisco Vargas, Kamen Brestnichki, Alex Papadopoulos-Korfiatis and, Nils Hammerla

PDF

1 Repo

TL;DR

This paper introduces a generative latent variable model for learning multilingual word representations from dictionaries, enabling robust alignment across languages and performing well despite noisy data.

Contribution

It presents a novel offline approach using a generative model to align multilingual embeddings, improving robustness and alignment quality.

Findings

01

Achieves competitive results on multilingual tasks

02

Robust to noise in embedding space

03

Effective for distributed representations from noisy corpora

Abstract

In this work we approach the task of learning multilingual word representations in an offline manner by fitting a generative latent variable model to a multilingual dictionary. We model equivalent words in different languages as different views of the same word generated by a common latent variable representing their latent lexical meaning. We explore the task of alignment by querying the fitted model for multilingual embeddings achieving competitive results across a variety of tasks. The proposed model is robust to noise in the embedding space making it a suitable method for distributed representations learned from noisy corpora.

Tables8

Table 1. Table 1: Precision @1 for cross-lingual word similarity tasks. Rows labelled AdvR are copies of Adversarial - Refine rows in Lample et al. ( 2018 ) . Results marked with a * differ from the ones shown in Lample et al. ( 2018 ) due to pre-processing done on their part. SVD and IBFA in the semi-supervised setting use the pseudo-dictionary, while AdvR uses frequency information. CSLS is the post-processing technique proposed in Lample et al. ( 2018 ) .

Method	en-es	es-en	en-fr	fr-en	en-de	de-en	en-ru	ru-en	en-zh	zh-en
Supervised
SVD	77.4	77.3	74.9	76.1	68.4	67.7	47.0	58.2	27.3*	09.3*
IBFA	79.5	81.5	77.3	79.5	70.7	72.1	46.7	61.3	42.9	36.9
SVD+CSLS	81.4	82.9	81.1	82.4	73.5	72.4	51.7	63.7	32.5*	25.1*
IBFA+CSLS	81.7	84.1	81.9	83.4	74.1	75.7	50.5	66.3	48.4	41.7
Semi-supervised
SVD	65.9	74.1	71.0	72.7	60.3	65.3	11.4	37.7	06.8	00.8
IBFA	76.1	80.1	77.1	78.9	66.8	71.8	23.1	39.9	17.1	24.0
AdvR	79.1	78.1	78.1	78.2	71.3	69.6	37.3	54.3	30.9	21.9
SVD+CSLS	73.0	80.7	75.7	79.6	65.3	70.8	20.9	41.5	10.5	01.7
IBFA+CSLS	76.5	83.7	78.6	82.3	68.7	73.7	25.3	46.3	22.1	27.2
AdvR+CSLS	81.7	83.3	82.3	82.1	74.0	72.2	44.0	59.1	32.5	31.4

Table 2. Table 2: Comparisons without post-processing of methods. Results reproduced from Smith et al. ( 2017 ) for fair comparison. Left : Comparisons using the same expert dictionary as Smith et al. ( 2017 ) . Right : Comparisons using the pseudo-dictionary from Smith et al. ( 2017 ) .

	English to Italian			Italian to English			English to Italian			Italian to English
	@1	@5	@10	@1	@5	@10	@1	@5	@10	@1	@5	@10
Mikolov et. al.	33.8	48.3	53.9	24.9	41.0	47.4	1.0	2.8	3.9	2.5	6.4	9.1
CCA (Sklearn)	36.1	52.7	58.1	31.0	49.9	57.0	29.1	46.4	53.0	27.0	47.0	52.3
CCA	30.9	48.1	52.7	27.7	45.5	51.0	26.5	42.5	48.1	22.8	40.1	45.5
SVD	36.9	52.7	57.9	32.2	49.6	55.7	27.1	43.4	49.3	26.2	42.1	49.0
IBFA (Ours)	39.3	55.3	60.1	34.7	53.5	59.4	34.7	52.6	58.3	33.7	53.3	59.2

Table 3. Table 3: Spearman correlation for English word similarity tasks. First row represents monolingual fasttext vectors Joulin et al. ( 2017 ) in English, the rest are bilingual embeddings.

Embeddings	WS	WS-SIM	WS-REL	RG-65	MC-30	MT-287	MT-771	MEN-TR
English	73.7	78.1	68.2	79.7	81.2	67.9	66.9	76.4
IBFA en-de	74.4	79.4	68.3	81.4	84.2	67.2	69.4	77.8
IBFA en-fr	72.4	77.8	65.8	80.5	83.0	68.2	69.6	77.6
IBFA en-es	73.6	78.5	67.0	79.0	83.0	68.2	69.4	77.3
CCA en-de	71.7	76.4	64.0	76.7	82.4	63.0	64.7	75.3
CCA en-fr	70.9	76.4	63.3	76.5	81.4	63.4	65.4	74.9
CCA en-es	70.8	76.3	63.1	76.4	81.2	63.0	65.1	74.7

Table 4. Table 4: Spearman correlation for Semantic Textual Similarity (STS) tasks in English. All results use the sentence embeddings described in Arora et al. ( 2016 ) . First row represents monolingual FastText vectors Joulin et al. ( 2017 ) in English, the rest are bilingual embeddings. *STS13 excludes the proprietary SMT dataset.

Embeddings	STS12	STS13*	STS14	STS15	STS16
English	58.1	69.2	66.7	72.6	70.6
IBFA en-de	58.1	70.2	66.8	73.0	71.6
IBFA en-fr	58.0	70.0	66.7	72.8	71.4
IBFA en-es	57.9	69.7	66.6	72.9	71.7
CCA en-de	56.7	67.5	65.7	73.1	70.5
CCA en-fr	56.7	67.9	65.9	72.8	70.8
CCA en-es	56.6	67.8	65.9	72.9	70.8

Table 5. Table 5: Sentence translation precisions @1, @5, @10 on 2,000 English-Italian pairs samples from a set of 200k sentences from Europarl Koehn ( 2005 ) on Dinu embeddings. AdvR is copied from Adversarial - Refined in Lample et al. ( 2018 ) . Rows with ✓ ✓ \checkmark copied from Smith et al. ( 2017 ) .

	English to Italian			Italian to English
	@1	@5	@10	@1	@5	@10
Mikolov et. al. $✓$	10.5	18.7	22.8	12.0	22.1	26.7
Dinu et al. $✓$	45.3	72.4	80.7	48.9	71.3	78.3
Smith et al. $✓$	54.6	72.7	78.2	42.9	62.2	69.2
SVD	40.5	52.6	56.9	51.2	63.7	67.9
IBFA (Ours)	62.7	74.2	77.9	64.1	75.2	79.5
SVD + CSLS	64.0	75.8	78.5	67.9	79.4	82.8
AdvR + CSLS	66.2	80.4	83.4	58.7	76.5	80.9
IBFA + CSLS	68.8	80.7	83.5	70.2	80.8	84.8

Table 6. Table 6: Precision @1 when aligning English, French and Italian embeddings to a common space. For SVD, this common space is English, while for MBFA it is the latent space.

Method	en-it	it-en	en-fr	fr-en	it-fr	fr-it
SVD	71.0	72.4	74.9	76.1	78.3	72.9
MBFA	71.9	73.4	76.7	78.1	82.6	77.5
SVD+CSLS	76.2	77.9	81.1	82.4	84.5	79.8
MBFA+CSLS	77.4	77.7	81.9	82.1	86.8	81.9

Table 7. Table 7: Random pairs sampled from model, selected top 28 ranked by confidence. Proper nouns, and acronyms (names and surnames) were removed from the list. Third column represents a correct translation from Spanish to English.

en	es	es $\to$ en
particular	efectivamente	effectively
correspondingly	esto	this
silly	irónicamente	ironic
frightening	brutalidad	brutality
manipulations	intencionadamente	intentionally
ignore	contraproducente	counter-productive
fundamentally	entendido	understood
embarrassed	enojado	angry
terrified	casualidad	coincidence
hypocritical	obviamente	obviously
wondered	incómodo	uncomfort-able
oftentimes	apostar	betting
unwittingly	traicionar	betray
mishap	irónicamente	ironically
veritable	empero	however
overpowered	deshacerse	fall apart
crazed	divertidos	merry
frightening	ironía	irony
dreadful	desesperación	despair
instituting	restablecimiento	recover
unrealistic	cuestionamiento	questioning
regrettable	erróneos	mistaken
irresponsible	preocupaciones	concerns
obsession	irremediablemente	hopelessly
embodied	voluntad	will
misguided	esconder	conceal
perspective	contestación	answer
reactionary	conservadurismo	conservatism

Table 8. Table 8: Precision @1 between MBFA fitted for 1K iterations and MBFA fitted for 20K iterations.

Method	EN-IT	IT-EN	EN-FR	FR-EN	IT-FR	FR-IT
MBFA-1K	71.9	73.3	76.7	78.2	82.4	77.5
MBFA-20K	71.9	73.4	76.7	78.1	82.6	77.5
MBFA-1K+CSLS	77.5	77.6	81.9	82.0	86.8	82.1
MBFA-20K+CSLS	77.4	77.7	81.9	82.1	86.8	81.9

Equations117

p (z)

p (z)

p (x ∣ z)

p (z)

p (z)

p (x ∣ z)

p (y ∣ z)

{Ψ_{i}, W_{i}} arg max k \prod p (x^{(k)}, y^{(k)} ∣ {Ψ_{i}, W_{i}}_{i}),

{Ψ_{i}, W_{i}} arg max k \prod p (x^{(k)}, y^{(k)} ∣ {Ψ_{i}, W_{i}}_{i}),

subject to Ψ_{i} ≻ 0, (W_{i}^{⊤} W_{i}) ≽ 0,

N

N

Σ_{ij} = W_{i} W_{j}^{⊤} + δ_{ij} Ψ_{i},

k \sum lo g p (x^{(k)}, y^{(k)} ∣ θ) = C + k \sum (L_{k}^{y ∣ x} + L_{k}^{x}),

k \sum lo g p (x^{(k)}, y^{(k)} ∣ θ) = C + k \sum (L_{k}^{y ∣ x} + L_{k}^{x}),

L_{k}^{y ∣ x} = - \frac{1}{2} ∣ ∣ \tilde{y}^{(k)} - W_{y} E_{p (z ∣ x^{(k)})} [z] ∣ ∣_{Σ_{y ∣ x}}^{2},

L_{k}^{x} = - \frac{1}{2} ∣ ∣ \tilde{x}^{(k)} - W_{x} E_{p (z ∣ x^{(k)})} [z] ∣ ∣_{Ψ_{x} Σ_{x}^{- 1} Ψ_{x}}^{2},

C = - \frac{N}{2} (lo g ∣2 π Σ_{y ∣ x} ∣ + lo g ∣2 π Σ_{x} ∣) .

\hat{W}_{i}

\hat{W}_{i}

\hat{Ψ}_{i}

\hat{μ}_{x}

S_{xx}

S_{xx}

S_{y y}

U_{i}

V_{x} P V_{y}^{⊤}

E_{p (z ∣ x)} [z]

E_{p (z ∣ x)} [z]

\tilde{x}

E_{p (z ∣ x)} [z] = P^{1/2} U_{x}^{⊤} (x - μ_{x}) .

E_{p (z ∣ x)} [z] = P^{1/2} U_{x}^{⊤} (x - μ_{x}) .

p (z)

p (z)

p (x_{i} ∣ z)

N x_{1} ⋮ x_{v}; μ_{1} ⋮ μ_{v}, W_{1} W_{1}^{⊤} + Ψ_{1} ⋮ W_{v} W_{1}^{⊤} \dots ⋱ \dots W_{1} W_{v}^{⊤} ⋮ W_{v} W_{v}^{⊤} + Ψ_{v} .

N x_{1} ⋮ x_{v}; μ_{1} ⋮ μ_{v}, W_{1} W_{1}^{⊤} + Ψ_{1} ⋮ W_{v} W_{1}^{⊤} \dots ⋱ \dots W_{1} W_{v}^{⊤} ⋮ W_{v} W_{v}^{⊤} + Ψ_{v} .

M_{t}

M_{t}

B_{t}

Ψ_{t + 1}

W_{t + 1}

W_{t + 1}

Ψ_{t + 1}

k \prod p (y^{(k)} ∣ x^{(k)}) = k \prod N (y^{(k)}; W x^{(k)} + μ, Ψ) .

k \prod p (y^{(k)} ∣ x^{(k)}) = k \prod N (y^{(k)}; W x^{(k)} + μ, Ψ) .

E_{p (y ∣ x^{(k)})} [y] = W x^{(k)} + μ,

E_{p (y ∣ x^{(k)})} [y] = W x^{(k)} + μ,

m

m

Ψ

p (m, z ∣ θ)

Σ_{m, z}

p (z ∣ x)

p (z ∣ x)

E [z ∣ x]

p (m ∣ θ)

p (m ∣ θ)

\displaystyle=\mathcal{N}\left(\left[\begin{array}[]{ l }\bm{x}\\ \bm{y}\end{array}\right];\left[\begin{array}[]{ l }\bm{\mu}_{x}\\ \bm{\mu}_{y}\end{array}\right],\bm{W}\bm{W}^{T}+\bm{\Psi}\right)

p (x, y ∣ θ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Babylonpartners/MultilingualFactorAnalysis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\SelectInputMappings

aacute=á, ntilde=ñ, Euro=€

Multilingual Factor Analysis

Francisco Vargas, Kamen Brestnichki, Alex Papadopoulos-Korfiatis

Nils Hammerla

Babylon Health

{firstname.lastname, alex.papadopoulos}@babylonhealth.com

Abstract

In this work we approach the task of learning multilingual word representations in an offline manner by fitting a generative latent variable model to a multilingual dictionary. We model equivalent words in different languages as different views of the same word generated by a common latent variable representing their latent lexical meaning. We explore the task of alignment by querying the fitted model for multilingual embeddings achieving competitive results across a variety of tasks. The proposed model is robust to noise in the embedding space making it a suitable method for distributed representations learned from noisy corpora.

1 Introduction

Popular approaches for multilingual alignment of word embeddings base themselves on the observation in Mikolov et al. (2013a), which noticed that continuous word embedding spaces (Mikolov et al., 2013b; Pennington et al., 2014; Bojanowski et al., 2017; Joulin et al., 2017) exhibit similar structures across languages. This observation has led to multiple successful methods in which a direct linear mapping between the two spaces is learned through a least squares based objective Mikolov et al. (2013a); Smith et al. (2017); Xing et al. (2015) using a paired bilingual dictionary.

An alternate set of approaches based on Canonical Correlation Analysis (CCA) Knapp (1978) seek to project monolingual embeddings into a shared multilingual space (Faruqui and Dyer, 2014b; Lu et al., 2015). Both these methods aim to exploit the correlations between the monolingual vector spaces when projecting into the aligned multilingual space. The multilingual embeddings from (Faruqui and Dyer, 2014b; Lu et al., 2015) are shown to improve on word level semantic tasks, which sustains the authors’ claim that multilingual information enhances semantic spaces.

In this paper we present a new non-iterative method based on variants of factor analysis (Browne, 1979; McDonald, 1970; Browne, 1980) for aligning monolingual representations into a multilingual space. Our generative modelling assumes that a single word translation pair is generated by an embedding representing the lexical meaning of the underlying concept. We achieve competitive results across a wide range of tasks compared to state-of-the-art methods, and we conjecture that our multilingual latent variable model has sound generative properties that match those of psycholinguistic theories of the bilingual mind Weinreich (1953). Furthermore, we show how our model extends to more than two languages within the generative framework which is something that previous alignment models are not naturally suited to, instead resorting to combining bilingual models with a pivot as in Ammar et al. (2016).

Additionally the general benefit of the probabilistic setup as discussed in Tipping and Bishop (1999) is that it offers the potential to extend the scope of conventional alignment methods to model and exploit linguistic structure more accurately. An example of such a benefit could be modelling how corresponding word translations can be generated by more than just a single latent concept. This assumption can be encoded by a mixture of Factor Analysers Ghahramani et al. (1996) to model word polysemy in a similar fashion to Athiwaratkun and Wilson (2017), where mixtures of Gaussians are used to reflect the different meanings of a word.

The main contribution of this work is the application of a well-studied graphical model to a novel domain, outperforming previous approaches on word and sentence-level translation retrieval tasks. We put the model through a battery of tests, showing it aligns embeddings across languages well, while retaining performance on monolingual word-level and sentence-level tasks. Finally, we apply a natural extension of this model to more languages in order to align three languages into a single common space.

2 Background

Previous work on the topic of embedding alignment has assumed that alignment is a directed procedure — i.e. we want to align French to English embeddings. However, another approach would be to align both to a common latent space that is not necessarily the same as either of the original spaces. This motivates applying a well-studied latent variable model to this problem.

2.1 Factor Analysis

Factor analysis (Spearman, 1904; Thurstone, 1931) is a technique originally developed in psychology to study the correlation of latent factors ${\bm{z}}\in\mathbb{R}^{k}$ on observed measurements ${\bm{x}}\in\mathbb{R}^{d}$ . Formally:

[TABLE]

In order to learn the parameters ${\bm{W}},\bm{\Psi}$ of the model we maximise the marginal likelihood $p({\bm{x}}|{\bm{W}},\bm{\Psi})$ with respect to ${\bm{W}},\bm{\Psi}$ . The maximum likelihood estimates of these procedures can be used to obtain latent representations for a given observation $\mathbb{E}_{p({\bm{z}}|{\bm{x}})}[{\bm{z}}]$ . Such projections have been found to be generalisations of principal component analysis Pearson (1901) as studied in Tipping and Bishop (1999).

2.2 Inter-Battery Factor Analysis

Inter-Battery Factor Analysis (IBFA) Tucker (1958); Browne (1979) is an extension of factor analysis that adapts it to two sets of variables ${\bm{x}}\in\mathbb{R}^{d},{\bm{y}}\in\mathbb{R}^{d^{\prime}}$ (i.e. embeddings of two languages). In this setting it is assumed that pairs of observations are generated by a shared latent variable ${\bm{z}}$

[TABLE]

As in traditional factor analysis, we seek to estimate the parameters that maximise the marginal likelihood

[TABLE]

where the joint marginal $p({\bm{x}}_{k},{\bm{y}}_{k}|\{\bm{\Psi}_{i},{\bm{W}}_{i}\}_{i})$ is a Gaussian with the form

[TABLE]

and $\bm{\Psi}\succ\bm{0}$ means $\bm{\Psi}$ is positive definite.

Maximising the likelihood as in Equation 2 will find the optimal parameters for the generative process described in Figure 1 where one latent ${\bm{z}}$ is responsible for generating a pair ${\bm{x}},{\bm{y}}$ . This makes it a suitable objective for aligning the vector spaces of ${\bm{x}},\>{\bm{y}}$ in the latent space. In contrast to the discriminative directed methods in (Mikolov et al., 2013a; Smith et al., 2017; Xing et al., 2015), IBFA has the capacity to model noise.

We can re-interpret the logarithm of Equation 2 (as shown in Appendix D) as

[TABLE]

The exact expression for $\bm{\Sigma}_{{\bm{y}}|{\bm{x}}}$ is given in the same appendix. This interpretation shows that for each pair of points, the objective is to minimise the reconstruction errors of ${\bm{x}}$ and ${\bm{y}}$ , given a projection into the latent space $\mathbb{E}_{p({\bm{z}}|{\bm{x}}_{k})}[{\bm{z}}]$ . By utilising the symmetry of Equation 2, we can show the converse is true as well — maximising the joint probability also minimises the reconstruction loss given the latent projections $\mathbb{E}_{p({\bm{z}}|{\bm{y}}_{k})}[{\bm{z}}]$ . Thus, this forces the latent embeddings of ${\bm{x}}_{k}$ and ${\bm{y}}_{k}$ to be close in the latent space. This provides intuition as to why embedding into this common latent space is a good alignment procedure.

In (Browne, 1979; Bach and Jordan, 2005) it is shown that the maximum likelihood estimates for $\{\bm{\Psi}_{i},{\bm{W}}_{i}\}$ can be attained in closed form

[TABLE]

where

[TABLE]

The projections into the latent space from ${\bm{x}}$ are given by (as proved in Appendix B)

[TABLE]

Evaluated at the MLE, Bach and Jordan (2005) show that Equation 4 can be reduced to

[TABLE]

2.2.1 Multiple-Battery Factor Analysis

Multiple-Battery Factor Analysis (MBFA) (McDonald, 1970; Browne, 1980) is a natural extension of IBFA that models more than two views of observables (i.e. multiple languages), as shown in Figure 2.

Formally, for a set of views $\{{\bm{x}}_{1},...,{\bm{x}}_{v}\}$ , we can write the model as

[TABLE]

Similar to IBFA the projections to the latent space are given by Equation 4, and the marginal yields a very similar form

[TABLE]

Unlike IBFA, a closed form solution for maximising the marginal likelihood of MBFA is unknown. Because of this, we have to resort to iterative approaches as in Browne (1980) such as the natural extension of the EM algorithm proposed by Bach and Jordan (2005). Defining

[TABLE]

the EM updates are given by

[TABLE]

where ${\bm{S}}$ is the sample covariance matrix of the concatenated views (derivation provided in Appendix E). Browne (1980) shows that, under suitable conditions, the MLE of the parameters of MBFA is uniquely identifiable (up to a rotation that does not affect the method’s performance). We observed this in an empirical study — the solutions we converge to are always a rotation away from each other, irrespective of the parameters’ initialisation. This heavily suggests that any optimum is a global optimum and thus we restrict ourselves to only reporting results we observed when fitting from a single initialisation. The chosen initialisation point is provided as Equation (3.25) of Browne (1980).

3 Multilingual Factor Analysis

We coin the term Multilingual Factor Analysis for the application of methods based on IBFA and MBFA to model the generation of multilingual tuples from a shared latent space. We motivate our generative process with the compound model for language association presented by Weinreich (1953). In this model a lexical meaning entity (a concept) is responsible for associating the corresponding words in the two different languages.

We note that the structure in Figure 3 is very similar to our graphical model for IBFA specified in Figure 1. We can interpret our latent variable as the latent lexical concept responsible for associating (generating) the multilingual language pairs. Most theories that explain the interconnections between languages in the bilingual mind assume that “while phonological and morphosyntactic forms differ across languages, meanings and/or concepts are largely, if not completely, shared” (Pavlenko, 2009). This shows that our generative modelling is supported by established models of language interconnectedness in the bilingual mind.

Intuitively, our approach can be summarised as transforming monolingual representations by mapping them to a concept space in which lexical meaning across languages is aligned and then performing retrieval, translation and similarity-based tasks in that aligned concept space.

3.1 Comparison to Direct Methods

Methods that learn a direct linear transformation from ${\bm{x}}$ to ${\bm{y}}$ , such as Mikolov et al. (2013a); Artetxe et al. (2016); Smith et al. (2017); Lample et al. (2018) could also be interpreted as maximising the conditional likelihood

[TABLE]

As shown in Appendix F, the maximum likelihood estimate for ${\bm{W}}$ does not depend on the noise term $\bm{\Psi}$ . In addition, even if one were to fit $\bm{\Psi}$ , it is not clear how to utilise it to make predictions as the conditional expectation

[TABLE]

does not depend on the noise parameters. As this method is therefore not robust to noise, previous work has used extensive regularisation (i.e. by making $\bm{W}$ orthogonal) to avoid overfitting.

3.2 Relation to CCA

CCA is a popular method used for multilingual alignment which is very closely related to IBFA, as detailed in Bach and Jordan (2005). Barber (2012) shows that CCA can be recovered as a limiting case of IBFA with constrained diagonal covariance $\bm{\Psi}_{x}=\sigma_{x}^{2}\mathbb{I},\>\bm{\Psi}_{y}=\sigma_{y}^{2}\mathbb{I}$ , as $\sigma_{x}^{2},\sigma_{y}^{2}\rightarrow 0$ . CCA assumes that the emissions from the latent spaces to the observables are deterministic. This is a strong and unrealistic assumption given that word embeddings are learned from noisy corpora and stochastic learning algorithms.

4 Experiments

In this section, we empirically demonstrate the effectiveness of our generative approach on several benchmarks, and compare it with state-of-the-art methods. We first present cross-lingual (word-translation) evaluation tasks to evaluate the quality of our multi-lingual word embeddings. As a follow-up to the word retrieval task we also run experiments on cross-lingual sentence retrieval tasks. We further demonstrate the quality of our multi-lingual word embeddings on monolingual word- and sentence-level similarity tasks from Faruqui and Dyer (2014b), which we believe provides empirical evidence that the aligned embeddings preserve and even potentially enhance their monolingual quality.

4.1 Word Translation

This task is concerned with the problem of retrieving the translation of a given set of source words. We reproduce results in the same environment as Lample et al. (2018)111github.com/Babylonpartners/MultilingualFactorAnalysis, based on github.com/facebookresearch/MUSE. for a fair comparison. We perform an ablation study to assess the effectiveness of our method in the Italian to English (it-en) setting in Smith et al. (2017); Dinu et al. (2014). In these experiments we are interested in studying the effectiveness of our method compared to that of the Procrustes-based fitting used in Smith et al. (2017) without any post-processing steps to address the hubness problem (Dinu et al., 2014). In Table 1 we observe how our model is competitive to the results in Lample et al. (2018) and outperforms them in most cases. We notice that given an expert dictionary, our method performs the best out of all compared methods on all tasks, except in English to Russian (en-ru) translation where it remains competitive. What is surprising is that, in the semi-supervised setting, IBFA bridges the gap between the method proposed in Lample et al. (2018) on languages where the dictionary of identical tokens across languages (i.e. the pseudo-dictionary from Smith et al. (2017)) is richer. However, even though it significantly outperforms SVD using the pseudo-dictionary, it cannot match the performance of the adversarial approach for more distant languages like English and Chinese (en-zh).

4.1.1 Detailed Comparison to Basic SVD

We present a more detailed comparison to the SVD method described in Smith et al. (2017). We focus on methods in their base form, that is without post-processing techniques, i.e. cross-domain similarity local scaling (CSLS) Lample et al. (2018) or inverted softmax (ISF) Smith et al. (2017). Note that Smith et al. (2017) used the scikit-learn 222A commonly used Python library for scientific computing, found at Pedregosa et al. (2011). implementation of CCA, which uses an iterative estimation of partial least squares. This does not give the same results as the standard CCA procedure. In Table 2 we reproduce the results from Smith et al. (2017) using the dictionaries and embeddings provided by Dinu et al. (2014)333https://zenodo.org/record/2654864 and we compare our method (IBFA) using both the expert dictionaries from Dinu et al. (2014) and the pseudo-dictionaries as constructed in Smith et al. (2017). We significantly outperform both SVD and CCA, especially when using the pseudo-dictionaries.

4.2 Word Similarity Tasks

This task assesses the monolingual quality of word embeddings. In this experiment, we fit both considered methods (CCA and IBFA) on the entire available dictionary of around 100k word pairs. We compare to CCA as used in Faruqui and Dyer (2014b) and standard monolingual word embeddings on the available tasks from Faruqui and Dyer (2014b). We evaluate our multilingual embeddings on the following tasks: WS353 Finkelstein et al. (2002); WS-SIM, WS-REL Agirre et al. (2009); RG65 Rubenstein and Goodenough (1965); MC-30 Miller and Charles (1991); MT-287; Radinsky et al. (2011); MT-771 Halawi et al. (2012), and MEN-TR Bruni et al. (2012). These tasks consist of English word pairs that have been assigned ground truth similarity scores by humans. We use the test-suite provided by Faruqui and Dyer (2014a)444https://github.com/mfaruqui/eval-word-vectors to evaluate our multilingual embeddings on these datasets. This test-suite calculates similarity of words through cosine similarity in their representation spaces and then reports Spearman correlation with the ground truth similarity scores provided by humans.

As shown in Table 3, we observe a performance gain over CCA and monolingual word embeddings suggesting that we not only preserve the monolingual quality of the embeddings but also enhance it.

4.3 Monolingual Sentence Similarity Tasks

Semantic Textual Similarity (STS) is a standard benchmark used to assess sentence similarity metrics (Agirre et al., 2012, 2013, 2014, 2015, 2016). In this work, we use it to show that our alignment procedure does not degrade the quality of the embeddings at the sentence level. For both IBFA and CCA, we align English and one other language (from French, Spanish, German) using the entire dictionaries (of about 100k word pairs each) provided by Lample et al. (2018). We then use the procedure defined in Arora et al. (2016) to create sentence embeddings and use cosine similarity to output sentence similarity using those embeddings. The method’s performance on each set of embeddings is assessed using Spearman correlation to human-produced expert similarity scores. As evidenced by the results shown in Table 4, IBFA remains competitive using any of the three languages considered, while CCA shows a performance decrease.

4.4 Crosslingual Sentence Similarity Tasks

Europarl (Koehn, 2005) is a parallel corpus of sentences taken from the proceedings of the European parliament. In this set of experiments, we focus on its English-Italian (en-it) sub-corpus, in order to compare to previous methods. We report results under the framework of Lample et al. (2018). That is, we form sentence embeddings using the average of the tf-idf weighted word embeddings in the bag-of-words representation of the sentence. Performance is averaged over 2,000 randomly chosen source sentence queries and 200k target sentences for each language pair. Note that this is a different set up to the one presented in Smith et al. (2017), in which an unweighted average is used. The results are reported in Table 5. As we can see, IBFA outperforms all prior methods both using nearest neighbour retrieval, where it has a gain of 20 percent absolute on SVD, as well as using the CSLS retrieval metric.

4.5 Alignment of three languages

In an ideal scenario, when we have $v$ languages, we wouldn’t want to train a transformation between each pair, as that would involve storing $\mathcal{O}(v^{2})$ matrices. One way to overcome this problem is by aligning all embeddings to a common space. In this exploratory experiment, we constrain ourselves to aligning three languages at the same time, but the same methodology could be applied to an arbitrary number of languages. MBFA, the extension of IBFA described in Section 2.2.1 naturally lends itself to this task. What is needed for training this method is a dictionary of word triples across the three languages considered. We construct such a dictionary by taking the intersection of all 6 pairs of bilingual dictionaries for the three languages provided by Lample et al. (2018). We then train MBFA for 20,000 iterations of EM (a brief analysis of convergence is provided in Appendix G). Alternatively, with direct methods like Smith et al. (2017); Lample et al. (2018) one could align all languages to English and treat that as the common space.

We compare both approaches and present their results in Table 6. As we can see, both methods experience a decrease in overall performance when compared to models fitted on just a pair of languages, however MBFA performs better overall. That is, the direct approaches preserve their performance on translation to and from English, but translation from French to Italian decreases significantly. Meanwhile, MBFA suffers a decrease in each pair of languages, however it retains competitive performance to the direct methods on English translation. It is worth noting that as the number of aligned languages $v$ increases, there are $O(v)$ pairs of languages, one of which is English, and $O(v^{2})$ pairs in which English does not participate. This suggests that MBFA may generalise past three simultaneously aligned languages better than the direct methods.

4.6 Generating Random Word Pairs

We explore the generative process of IBFA by synthesising word pairs from noise, using a trained English-Spanish IBFA model. We follow the generative process specified in Equation 1 to generate 2,000 word vector pairs and then we find the nearest neighbour vector in each vocabulary and display the corresponding words. We then rank these 2,000 pairs according to their joint probability under the model and present the top 28 samples in Table 7. Note that whilst the sampled pairs are not exact translations, they have closely related meanings. The examples we found interesting are dreadful and despair; frightening and brutality; crazed and merry; unrealistic and questioning; misguided and conceal; reactionary and conservatism.

5 Conclusion

We have introduced a cross-lingual embedding alignment procedure based on a probabilistic latent variable model, that increases performance across various tasks compared to previous methods using both nearest neighbour retrieval, as well as the CSLS criterion. We have shown that the resulting embeddings in this aligned space preserve their quality by presenting results on tasks that assess word and sentence-level monolingual similarity correlation with human scores. The resulting embeddings also significantly increase the precision of sentence retrieval in multilingual settings. Finally, the preliminary results we have shown on aligning more than two languages at the same time provide an exciting path for future research.

Appendix A Joint Distribution

We show the form of the joint distribution for 2 views. Concatenating our data and parameters as below, we can use Equation (3) of Ghahramani et al. (1996) to write

[TABLE]

It is clear that this generalises to any number of views of any dimension, as the concatenation operation does not make any assumptions.

Appendix B Projections to Latent Space $\mathbb{E}_{p({\bm{z}}|{\bm{x}})}[{\bm{z}}]$

We can query the joint Gaussian in 18 using rules from Petersen et al. (2008) Sections (8.1.2, 8.1.3) and we get

[TABLE]

Appendix C Derivation for the Marginal Likelihood

We want to compute $p(\bm{x},\bm{y}|\bm{\theta})$ so that we can then learn the parameters $\bm{\theta}=\{\bm{\theta}_{x},\bm{\theta}_{y}\}$ , $\bm{\theta}_{i}=\{\bm{\mu}_{i},\bm{W}_{i},\bm{\Psi}_{i},\}$ by maximising the marginal likelihood as is done in Factor Analysis.

From the joint $p(\bm{m},\bm{z}|\bm{\theta})$ , again using rules from Petersen et al. (2008) Sections (8.1.2) we get

[TABLE]

For the case of two views, the joint probability can be factored as

[TABLE]

where

[TABLE]

Appendix D Scaled Reconstruction Errors

[TABLE]

Setting $\bm{A}=\bm{\Psi}_{x}\bm{\Sigma}_{x}^{-1}\bm{\Psi}_{x}$ , we can re-parametrise as

[TABLE]

Appendix E Expectation Maximisation for MBFA

Define

[TABLE]

Hence

[TABLE]

This follows the same form as regular factor analysis, but with a block-diagonal constraint on $\bm{\Psi}$ . Thus by Equations (5) and (6) of Ghahramani et al. (1996), we apply EM as follows.

E-Step: Compute $\mathbb{E}[\bm{z}|\bm{x}]$ and $\mathbb{E}[\bm{zz^{\top}}|\bm{x}]$ given the parameters $\bm{\theta}_{t}=\{\bm{W}_{t},\bm{\Psi}_{t}\}$ .

[TABLE]

where

[TABLE]

Equation 26 is obtained by applying the Woodbury identity, and Equation 27 by applying the closely related push-through identity, as found in Section 3.2 of Petersen et al. (2008).

M-Step: Update parameters $\bm{\theta}_{t\!+\!1}\!=\!\{\bm{W}_{t\!+\!1},\bm{\Psi}_{t\!+\!1}\}$ .

Define

[TABLE]

By first observing

[TABLE]

update the parameters as follows.

[TABLE]

Imposing the block diagonal constraint,

[TABLE]

where $(\tilde{\bm{\Psi}})_{ii}=\bm{\Psi}_{i}$ .

Appendix F Independence to Noise in Direct Methods

We are maximising the following quantity with respect to $\bm{\theta}=\{\bm{W},\bm{\mu},\bm{\Psi}\}$

[TABLE]

Then the partial derivative $\mathcal{Q}=\frac{\partial\log p(\bm{Y}|\bm{X},\bm{\theta})}{\partial\bm{W}}$ is proportional to

[TABLE]

The maximum likelihood is achieved when

[TABLE]

and since $\bm{\Psi}^{-1}$ has an inverse (namely $\bm{\Psi}$ ), this means that

[TABLE]

It is clear from here that the MLE of $\bm{W}$ does not depend on $\bm{\Psi}$ , thus we can conclude that adding a noise parameter to this directed linear model has no effect on its predictions.

Appendix G Learning curve of EM

Figure 4 shows the negative log-likelihood of the three language model over the first 5,000 iterations. The precision of the learned model is very close when evaluated at iteration 1,000 and at iteration 20,000 as seen in Table 8. This suggests that the model need not be trained to full convergence to work well.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agirre et al. (2009) Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pages 19–27. Association for Computational Linguistics.
2Agirre et al. (2015) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al. 2015. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (Sem Eval 2015) , pages 252–263.
3Agirre et al. (2014) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (Sem Eval 2014) , pages 81–91.
4Agirre et al. (2016) Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (Sem Eval-2016) , pages 497–511.
5Agirre et al. (2013) Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. * sem 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity , volume 1, pages 32–43.
6Agirre et al. (2012) Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation , pages 385–393. Association for Computational Linguistics.
7Ammar et al. (2016) Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A Smith. 2016. Massively multilingual word embeddings. ar Xiv preprint ar Xiv:1602.01925 .
8Arora et al. (2016) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A simple but tough-to-beat baseline for sentence embeddings. International Conference on Learning Representations, 2017 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Multilingual Factor Analysis

Abstract

1 Introduction

2 Background

2.1 Factor Analysis

2.2 Inter-Battery Factor Analysis

2.2.1 Multiple-Battery Factor Analysis

3 Multilingual Factor Analysis

3.1 Comparison to Direct Methods

3.2 Relation to CCA

4 Experiments

4.1 Word Translation

4.1.1 Detailed Comparison to Basic SVD

4.2 Word Similarity Tasks

4.3 Monolingual Sentence Similarity Tasks

4.4 Crosslingual Sentence Similarity Tasks

4.5 Alignment of three languages

4.6 Generating Random Word Pairs

5 Conclusion

Appendix A Joint Distribution

Appendix B Projections to Latent Space Ep(z∣x)[z]\mathbb{E}_{p({\bm{z}}|{\bm{x}})}[{\bm{z}}]Ep(z∣x)​[z]

Appendix C Derivation for the Marginal Likelihood

Appendix D Scaled Reconstruction Errors

Appendix E Expectation Maximisation for MBFA

Appendix F Independence to Noise in Direct Methods

Appendix G Learning curve of EM

Appendix B Projections to Latent Space $\mathbb{E}_{p({\bm{z}}|{\bm{x}})}[{\bm{z}}]$