A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings   Based on Graph Modularity

Yoshinari Fujinuma; Jordan Boyd-Graber; Michael J. Paul

arXiv:1906.01926·cs.CL·March 24, 2022

A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Yoshinari Fujinuma, Jordan Boyd-Graber, Michael J. Paul

PDF

1 Repo

TL;DR

This paper introduces a resource-free, graph modularity-based metric to evaluate cross-lingual word embeddings, correlating well with downstream task performance and aiding unsupervised embedding improvement, especially for distant languages.

Contribution

It proposes a novel intrinsic evaluation metric based solely on embedding structure, eliminating the need for external resources.

Findings

01

Modularity correlates with downstream task performance.

02

The metric improves unsupervised embeddings for distant language pairs.

03

It provides a resource-free way to evaluate cross-lingual embeddings.

Abstract

Cross-lingual word embeddings encode the meaning of words from different languages into a shared low-dimensional space. An important requirement for many downstream tasks is that word similarity should be independent of language - i.e., word vectors within one language should not be more similar to each other than to words in another language. We measure this characteristic using modularity, a network measurement that measures the strength of clusters in a graph. Modularity has a moderate to strong correlation with three downstream tasks, even though modularity is based only on the structure of embeddings and does not require any external resources. We show through experiments that modularity can serve as an intrinsic validation metric to improve unsupervised cross-lingual word embeddings, particularly on distant language pairs in low-resource settings.

Tables6

Table 1. Table 1: Dataset statistics (source and number of tokens) for each language including both Indo-European and non-Indo-European languages.

Language	Corpus	Tokens
English (en)	News	23M
Spanish (es)	News	25M
Italian (it)	News	23M
Danish (da)	News	20M
Japanese (ja)	News	28M
Hungarian (hu)	News	20M
Amharic (am)	lorelei	28M

Table 2. Table 2: Average classification accuracy on ( en → → \rightarrow da , es , it , ja ) along with the average modularity of five cross-lingual word embeddings. muse has the best accuracy, captured by its low modularity.

	Method	Acc.	Modularity
	mse	0.399	0.529
Supervised	cca	0.502	0.513
	mse+Orth	0.628	0.452
Unsupervised	muse	0.711	0.431
Unsupervised	vecmap	0.643	0.432

Table 3. Table 3: Nearest neighbors in an en - ja embedding. Unlike the ja word “market”, the ja word “closing price” has no en vector nearby.

市場 “market”	終値 “closing price”
新興 “new coming”	上げ幅 “gains”
market	株価 “stock price”
markets	年初来 “yearly”
軟調 “bearish”	続落 “continued fall”
マーケット “market”	月限 “contract month”
活況 “activity”	安値 “low price”
相場 “market price”	続伸 “continuous rise”
底入 “bottoming”	前日 “previous day”
為替 “exchange”	先物 “futures”
ctoc	小幅 “narrow range”

Table 4. Table 4: Average precision@1 on ( en → → \rightarrow da , es , it , ja ) along with the average modularity of the cross-lingual word embeddings trained with different methods. vecmap scores the best P@1, which is captured by its low modularity.

	Method	P@1	Modularity
	mse	7.30	0.529
Supervised	cca	3.06	0.513
	mse+Orth	10.57	0.452
Unsupervised	muse	11.83	0.431
Unsupervised	vecmap	12.92	0.432

Table 5. Table 5: Correlation between modularity and AUC on document retrieval. It shows a moderate correlation to this task.

Lang.	Method	AUC	Mod.
am	mse	0.578	0.628
	cca	0.345	0.501
	mse+Orth	0.606	0.480
	muse	0.555	0.475
	vecmap	0.592	0.506
hu	mse	0.561	0.598
	cca	0.675	0.506
	mse+Orth	0.612	0.447
	muse	0.664	0.445
	vecmap	0.612	0.432
Spearman Correlation $ρ$		$- 0.378$

Table 6. Table 6: bli results (P@1 × 100 % absent percent 100 \times 100\% ) from en to each target language with different validation metrics for muse : default ( csls -10K) and modularity (Mod-10K). We report the average (Avg.) and the best (Best) from ten runs with ten random seeds for each validation metric. Bold values are mappings that are not shared between the two validation metrics. Mod-10K improves the robustness of muse on distant language pairs.

Family	Lang.	csls-10K		Mod-10K
		Avg.	Best	Avg.	Best
Germanic	da	52.62	60.27	52.18	60.13
Germanic	de	75.27	75.60	75.16	75.53
Romance	es	74.35	83.00	74.32	83.00
Romance	it	78.41	78.80	78.43	78.80
Indo-Iranian	fa	27.79	33.40	27.77	33.40
	hi	25.71	33.73	26.39	34.20
	bn	0.00	0.00	0.09	0.87
Others	fi	4.71	47.07	4.71	47.07
	hu	52.55	54.27	52.35	54.73
	ja	18.13	49.69	36.13	49.69
	zh	5.01	37.20	10.75	37.20
	ko	16.98	20.68	17.34	22.53
	ar	15.43	33.33	15.71	33.67
	id	67.69	68.40	67.82	68.40
	vi	0.01	0.07	0.01	0.07

Equations10

a_{l} = \frac{1}{2 m} i \sum d_{i} \mathds 1 [g_{i} = l],

a_{l} = \frac{1}{2 m} i \sum d_{i} \mathds 1 [g_{i} = l],

e_{l l} = \frac{1}{2 m} ij \sum A_{ij} \mathds 1 [g_{i} = l] \mathds 1 [g_{j} = l] .

e_{l l} = \frac{1}{2 m} ij \sum A_{ij} \mathds 1 [g_{i} = l] \mathds 1 [g_{j} = l] .

Q = l = 1 \sum L (e_{l l} - a_{l}^{2}) .

Q = l = 1 \sum L (e_{l l} - a_{l}^{2}) .

Q_{n or m} = \frac{Q}{Q _{ma x}}, where Q_{ma x} = 1 - l = 1 \sum L (a_{l}^{2}) .

Q_{n or m} = \frac{Q}{Q _{ma x}}, where Q_{ma x} = 1 - l = 1 \sum L (a_{l}^{2}) .

csls (W s, t) = 2 cos (W s, t) - r (W s) - r (t)

csls (W s, t) = 2 cos (W s, t) - r (W s) - r (t)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

akkikiki/modularity_metric
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Yoshinari Fujinuma

Computer Science

University of Colorado

[email protected] &Jordan Boyd-Graber

cs, iSchool, umiacs, lsc

University of Maryland

[email protected] &Michael J. Paul

Information Science

University of Colorado

[email protected]

Abstract

Cross-lingual word embeddings encode the meaning of words from different languages into a shared low-dimensional space. An important requirement for many downstream tasks is that word similarity should be independent of language—i.e., word vectors within one language should not be more similar to each other than to words in another language. We measure this characteristic using modularity, a network measurement that measures the strength of clusters in a graph. Modularity has a moderate to strong correlation with three downstream tasks, even though modularity is based only on the structure of embeddings and does not require any external resources. We show through experiments that modularity can serve as an intrinsic validation metric to improve unsupervised cross-lingual word embeddings, particularly on distant language pairs in low-resource settings.111Our code is at https://github.com/akkikiki/modularity_metric

1 Introduction

The success of monolingual word embeddings in natural language processing (Mikolov et al., 2013b) has motivated extensions to cross-lingual settings. Cross-lingual word embeddings—where multiple languages share a single distributed representation—work well for classification (Klementiev et al., 2012; Ammar et al., 2016) and machine translation (Lample et al., 2018; Artetxe et al., 2018b), even with few bilingual pairs (Artetxe et al., 2017) or no supervision at all Zhang et al. (2017); Conneau et al. (2018); Artetxe et al. (2018a).

Typically the quality of cross-lingual word embeddings is measured with respect to how well they improve a downstream task. However, sometimes it is not possible to evaluate embeddings for a specific downstream task, for example a future task that does not yet have data or on a rare language that does not have resources to support traditional evaluation. In such settings, it is useful to have an intrinsic evaluation metric: a metric that looks at the embedding space itself to know whether the embedding is good without resorting to an extrinsic task. While extrinsic tasks are the ultimate arbiter of whether cross-lingual word embeddings work, intrinsic metrics are useful for low-resource languages where one often lacks the annotated data that would make an extrinsic evaluation possible.

However, few intrinsic measures exist for cross-lingual word embeddings, and those that do exist require external linguistic resources (e.g., sense-aligned corpora in Ammar et al. (2016)). The requirement of language resources makes this approach limited or impossible for low-resource languages, which are the languages where intrinsic evaluations are most needed. Moreover, requiring language resources can bias the evaluation toward words in the resources rather than evaluating the embedding space as a whole.

Our solution involves a graph-based metric that considers the characteristics of the embedding space without using linguistic resources. To sketch the idea, imagine a cross-lingual word embedding space where it is possible to draw a hyperplane that separates all word vectors in one language from all vectors in another. Without knowing anything about the languages, it is easy to see that this is a problematic embedding: the representations of the two languages are in distinct parts of the space rather than using a shared space. While this example is exaggerated, this characteristic where vectors are clustered by language often appears within smaller neighborhoods of the embedding space, we want to discover these clusters.

To measure how well word embeddings are mixed across languages, we draw on concepts from network science. Specifically, some cross-lingual word embeddings are modular by language: vectors in one language are consistently closer to each other than vectors in another language (Figure 1). When embeddings are modular, they often fail on downstream tasks (Section 2).

Modularity is a concept from network theory (Section 3); because network theory is applied to graphs, we turn our word embeddings into a graph by connecting nearest-neighbors—based on vector similarity—to each other. Our hypothesis is that modularity will predict how useful the embedding is in downstream tasks; low-modularity embeddings should work better.

We explore the relationship between modularity and three downstream tasks (Section 4) that use cross-lingual word embeddings differently: (i) cross-lingual document classification; (ii) bilingual lexical induction in Italian, Japanese, Spanish, and Danish; and (iii) low-resource document retrieval in Hungarian and Amharic, finding moderate to strong negative correlations between modularity and performance. Furthermore, using modularity as a validation metric (Section 5) makes muse Conneau et al. (2018), an unsupervised model, more robust on distant language pairs. Compared to other existing intrinsic evaluation metrics, modularity captures complementary properties and is more predictive of downstream performance despite needing no external resources (Section 6).

2 Background: Cross-Lingual Word Embeddings and their Evaluation

There are many approaches to training cross-lingual word embeddings. This section reviews the embeddings we consider in this paper, along with existing work on evaluating those embeddings.

2.1 Cross-Lingual Word Embeddings

We focus on methods that learn a cross-lingual vector space through a post-hoc mapping between independently constructed monolingual embeddings (Mikolov et al., 2013a; Vulić and Korhonen, 2016). Given two separate monolingual embeddings and a bilingual seed lexicon, a projection matrix can map translation pairs in a given bilingual lexicon to be near each other in a shared embedding space. A key assumption is that cross-lingually coherent words have “similar geometric arrangements” (Mikolov et al., 2013a) in the embedding space, enabling “knowledge transfer between languages” (Ruder et al., 2017).

We focus on mapping-based approaches for two reasons. First, these approaches are applicable to low-resource languages because they do not requiring large bilingual dictionaries or parallel corpora (Artetxe et al., 2017; Conneau et al., 2018).222Ruder et al. (2017) offers detailed discussion on alternative approaches. Second, this focus separates the word embedding task from the cross-lingual mapping, which allows us to focus on evaluating the specific multilingual component in Section 4.

2.2 Evaluating Cross-Lingual Embeddings

Most work on evaluating cross-lingual embeddings focuses on extrinsic evaluation of downstream tasks (Upadhyay et al., 2016; Glavas et al., 2019). However, intrinsic evaluations are crucial since many low-resource languages lack annotations needed for downstream tasks. Thus, our goal is to develop an intrinsic measure that correlates with downstream tasks without using any external resources. This section summarizes existing work on intrinsic methods of evaluation for cross-lingual embeddings.

One widely used intrinsic measure for evaluating the coherence of monolingual embeddings is qvec (Tsvetkov et al., 2015). Ammar et al. (2016) extend qvec by using canonical correlation analysis (qvec-cca) to make the scores comparable across embeddings with different dimensions. However, while both qvec and qvec-cca can be extended to cross-lingual word embeddings, they are limited: they require external annotated corpora. This is problematic in cross-lingual settings since this requires annotation to be consistent across languages (Ammar et al., 2016).

Other internal metrics do not require external resources, but those consider only part of the embeddings. Conneau et al. (2018) and Artetxe et al. (2018a) use a validation metric that calculates similarities of cross-lingual neighbors to conduct model selection. Our approach differs in that we consider whether cross-lingual nearest neighbors are relatively closer than intra-lingual nearest neighbors.

Søgaard et al. (2018) use the similarities of intra-lingual neighbors and compute graph similarity between two monolingual lexical subgraphs built by subsampled words in a bilingual lexicon. They further show that the resulting graph similarity has a high correlation with bilingual lexical induction on muse (Conneau et al., 2018). However, their graph similarity still only uses intra-lingual similarities but not cross-lingual similarities.

These existing metrics are limited by either requiring external resources or considering only part of the embedding structure (e.g., intra-lingual but not cross-lingual neighbors). In contrast, our work develops an intrinsic metric which is highly correlated with multiple downstream tasks but does not require external resources, and considers both intra- and cross-lingual neighbors.

Related Work

A related line of work is the intrinsic evaluation measures of probabilistic topic models, which are another low-dimensional representation of words similar to word embeddings. Metrics based on word co-occurrences have been developed for measuring the monolingual coherence of topics (Newman et al., 2010; Mimno et al., 2011; Lau et al., 2014). Less work has studied evaluation of cross-lingual topics (Mimno et al., 2009). Some researchers have measured the overlap of direct translations across topics (Boyd-Graber and Blei, 2009), while Hao et al. (2018) propose a metric based on co-occurrences across languages that is more general than direct translations.

3 Approach: Graph-Based Diagnostics for Detecting Clustering by Language

This section describes our graph-based approach to measure the intrinsic quality of a cross-lingual embedding space.

3.1 Embeddings as Lexical Graphs

We posit that we can understand the quality of cross-lingual embeddings by analyzing characteristics of a lexical graph (Pelevina et al., 2016; Hamilton et al., 2016). The lexical graph has words as nodes and edges weighted by their similarity in the embedding space. Given a pair of words $(i,j)$ and associated word vectors $(v_{i},v_{j})$ , we compute the similarity between two words by their vector similarity. We encode this similarity in a weighted adjacency matrix $A$ : $A_{ij}\equiv\max(0,\text{cos\_sim}(v_{i},v_{j}))$ . However, nodes are only connected to their $k$ -nearest neighbors (Section 6.2 examines the sensitivity to $k$ ); all other edges become zero. Finally, each node $i$ has a label $g_{i}$ indicating the word’s language.

3.2 Clustering by Language

We focus on a phenomenon that we call “clustering by language”, when word vectors in the embedding space tend to be more similar to words in the same language than words in the other. For example in Figure 2, the intra-lingual nearest neighbors of “slow” have higher similarity in the embedding space than semantically related cross-lingual words. This indicates that words are represented differently across the two languages, thus our hypothesis is that clustering by language degrades the quality of cross-lingual embeddings when used in downstream tasks.

3.3 Modularity of Lexical Graphs

With a labeled graph, we can now ask whether the graph is modular Newman (2010). In a cross-lingual lexical graph, modularity is the degree to which words are more similar to words in the same language than to words in a different language. This is undesirable, because the representation of words is not transferred across languages. If the nearest neighbors of the words are instead within the same language, then the languages are not mapped into the cross-lingual space consistently. In our setting, the language $l$ of each word defines its group, and high modularity indicates embeddings are more similar within languages than across languages (Newman, 2003; Newman and Girvan, 2004). In other words, good embeddings should have low modularity.

Conceptually, the modularity of a lexical graph is the difference between the proportion of edges in the graph that connect two nodes from the same language and the expected proportion of such edges in a randomly connected lexical graph. If edges were random, the number of edges starting from node $i$ within the same language would be the degree of node $i$ , $d_{i}=\sum_{j}A_{ij}$ for a weighted graph, following Newman (2004), times the proportion of words in that language. Summing over all nodes gives the expected number of edges within a language,

[TABLE]

where $m$ is the number of edges, $g_{i}$ is the label of node $i$ , and $\mathds{1}\left[\cdot\right]$ is an indicator function that evaluates to $1$ if the argument is true and [math] otherwise.

Next, we count the fraction of edges $e_{ll}$ that connect words of the same language:

[TABLE]

Given $L$ different languages, we calculate overall modularity $Q$ by taking the difference between $e_{ll}$ and $a_{l}^{2}$ for all languages:

[TABLE]

Since $Q$ does not necessarily have a maximum value of $1$ , we normalize modularity:

[TABLE]

The higher the modularity, the more words from the same language appear as nearest neighbors. Figure 1 shows the example of a lexical subgraph with low modularity (left, $Q_{norm}=0.143$ ) and high modularity (right, $Q_{norm}=0.672$ ). In Figure 1(b), the lexical graph is modular since “firefox” does not encode same sense in both languages.

Our hypothesis is that cross-lingual word embeddings with lower modularity will be more successful in downstream tasks. If this hypothesis holds, then modularity could be a useful metric for cross-lingual evaluation.

4 Experiments: Correlation of Modularity with Downstream Success

We now investigate whether modularity can predict the effectiveness of cross-lingual word embeddings on three downstream tasks: (i) cross-lingual document classification, (ii) bilingual lexical induction, and (iii) document retrieval in low-resource languages. If modularity correlates with task performance, it can characterize embedding quality.

4.1 Data

To investigate the relationship between embedding effectiveness and modularity, we explore five different cross-lingual word embeddings on six language pairs (Table 1).

Monolingual Word Embeddings

All monolingual embeddings are trained using a skip-gram model with negative sampling (Mikolov et al., 2013b). The dimension size is $100$ or $200$ . All other hyperparameters are default in Gensim Řehůřek and Sojka (2010). News articles except for Amharic are from Leipzig Corpora (Goldhahn et al., 2012). For Amharic, we use documents from lorelei (Strassel and Tracey, 2016). MeCab (Kudo et al., 2004) tokenizes Japanese sentences.

Bilingual Seed Lexicon

For supervised methods, bilingual lexicons from Rolston and Kirchhoff (2016) induce all cross-lingual embeddings except for Danish, which uses Wiktionary.333https://en.wiktionary.org/

4.2 Cross-Lingual Mapping Algorithms

We use three supervised (mse, mse+Orth, cca) and two unsupervised (muse, vecmap) cross-lingual mappings:444We use the implementations from original authors with default parameters unless otherwise noted.

Mean-squared error (mse)

Mikolov et al. (2013a) minimize the mean-squared error of bilingual entries in a seed lexicon to learn a projection between two embeddings. We use the implementation by Artetxe et al. (2016).

mse with orthogonal constraints (mse+Orth)

Xing et al. (2015) add length normalization and orthogonal constraints to preserve the cosine similarities in the original monolingual embeddings. Artetxe et al. (2016) further preprocess monolingual embeddings by mean centering.555One round of iterative normalization Zhang et al. (2019)

Canonical Correlation Analysis (cca)

Faruqui and Dyer (2014) maps two monolingual embeddings into a shared space by maximizing the correlation between translation pairs in a seed lexicon.

Conneau et al. (2018, muse)

use language-adversarial learning (Ganin et al., 2016) to induce the initial bilingual seed lexicon, followed by a refinement step, which iteratively solves the orthogonal Procrustes problem Schönemann (1966); Artetxe et al. (2017), aligning embeddings without an external bilingual lexicon. Like mse+Orth, vectors are unit length and mean centered. Since muse is unstable Artetxe et al. (2018a); Søgaard et al. (2018), we report the best of five runs.

Artetxe et al. (2018a, vecmap)

induce an initial bilingual seed lexicon by aligning intra-lingual similarity matrices computed from each monolingual embedding. We report the best of five runs to address uncertainty from the initial dictionary.

4.3 Modularity Implementation

We implement modularity using random projection trees (Dasgupta and Freund, 2008) to speed up the extraction of $k$ -nearest neighbors,666https://github.com/spotify/annoy tuning $k=3$ on the German rcv2 dataset (Section 6.2).

4.4 Task 1: Document Classification

We now explore the correlation of modularity and accuracy on cross-lingual document classification. We classify documents from the Reuters rcv1 and rcv2 corpora (Lewis et al., 2004). Documents have one of four labels (Corporate/Industrial, Economics, Government/Social, Markets). We follow Klementiev et al. (2012), except we use all en training documents and documents in each target language (da, es, it, and ja) as tuning and test data. After removing out-of-vocabulary words, we split documents in target languages into $10\%$ tuning data and $90\%$ test data. Test data are 10,067 documents for da, 25,566 for it, 58,950 for ja, and 16,790 for es. We exclude languages Reuters lacks: hu and am. We use deep averaging networks (Iyyer et al., 2015, dan) with three layers, 100 hidden states, and 15 epochs as our classifier. The dan had better accuracy than averaged perceptron (Collins, 2002) in Klementiev et al. (2012).

Results

We report the correlation value computed from the data points in Figure 3. Spearman’s correlation between modularity and classification accuracy on all languages is $\rho=-0.665$ . Within each language pair, modularity has a strong correlation within en-es embeddings ( $\rho=-0.806$ ), en-ja ( $\rho=-0.794$ ), en-it ( $\rho=-0.784$ ), and a moderate correlation within en-da embeddings ( $\rho=-0.515$ ). muse has the best classification accuracy (Table 2), reflected by its low modularity.

Error Analysis

A common error in en $\rightarrow$ ja classification is predicting Corporate/Industrial for documents labeled Markets. One cause is documents with 終値 “closing price”; this has few market-based English neighbors (Table 3). As a result, the model fails to transfer across languages.

4.5 Task 2: Bilingual Lexical Induction (bli)

Our second downstream task explores the correlation between modularity and bilingual lexical induction (bli). We evaluate on the test set from Conneau et al. (2018), but we remove pairs in the seed lexicon from Rolston and Kirchhoff (2016). The result is 2,099 translation pairs for es, 1,358 for it, 450 for da, and 973 for ja. We report precision@1 (P@1) for retrieving cross-lingual nearest neighbors by cross-domain similarity local scaling (Conneau et al., 2018, csls).

Results

Although this task ignores intra-lingual nearest neighbors when retrieving translations, modularity still has a high correlation ( $\rho=-0.785$ ) with P@1 (Figure 4). muse and vecmap beat the three supervised methods, which have the lowest modularity (Table 4). P@1 is low compared to other work on the muse test set (e.g., Conneau et al. (2018)) because we filter out translation pairs which appeared in the large training lexicon compiled by Rolston and Kirchhoff (2016), and the raw corpora used to train monolingual embeddings (Table 1) are relatively small compared to Wikipedia.

4.6 Task 3: Document Retrieval in Low-Resource Languages

As a third downstream task, we turn to an important task for low-resource languages: lexicon expansion (Gupta and Manning, 2015; Hamilton et al., 2016) for document retrieval. Specifically, we start with a set of en seed words relevant to a particular concept, then find related words in a target language for which a comprehensive bilingual lexicon does not exist. We focus on the disaster domain, where events may require immediate nlp analysis (e.g., sorting sms messages to first responders).

We induce keywords in a target language by taking the $n$ nearest neighbors of the English seed words in a cross-lingual word embedding. We manually select sixteen disaster-related English seed words from Wikipedia articles, “Natural hazard” and “Anthropogenic hazard”. Examples of seed terms include “earthquake” and “flood”. Using the extracted terms, we retrieve disaster-related documents by keyword matching and assess the coverage and relevance of terms by area under the precision-recall curve (auc) with varying $n$ .

Test Corpora

As positively labeled documents, we use documents from the lorelei project (Strassel and Tracey, 2016) containing any disaster-related annotation. There are $64$ disaster-related documents in Amharic, and $117$ in Hungarian. We construct a set of negatively labeled documents from the Bible; because the lorelei corpus does not include negative documents and the Bible is available in all our languages (Christodouloupoulos and Steedman, 2015), we take the chapters of the gospels ( $89$ documents), which do not discuss disasters, and treat these as non-disaster-related documents.

Results

Modularity has a moderate correlation with auc ( $\rho=-0.378$ , Table 5). While modularity focuses on the entire vocabulary of cross-lingual word embeddings, this task focuses on a small, specific subset—disaster-relevant words—which may explain the low correlation compared to bli or document classification.

5 Use Case: Model Selection for muse

A common use case of intrinsic measures is model selection. We focus on muse Conneau et al. (2018) since it is unstable, especially on distant language pairs Artetxe et al. (2018a); Søgaard et al. (2018); Hoshen and Wolf (2018) and therefore requires an effective metric for model selection. muse uses a validation metric in its two steps: (1) the language-adversarial step, and (2) the refinement step. First the algorithm selects an optimal mapping $W$ using a validation metric, obtained from language-adversarial learning Ganin et al. (2016). Then the selected mapping $W$ from the language-adversarial step is passed on to the refinement step (Artetxe et al., 2017) to re-select the optimal mapping $W$ using the same validation metric after each epoch of solving the orthogonal Procrustes problem (Schönemann, 1966).

Normally, muse uses an intrinsic metric, csls of the top 10K frequent words (Conneau et al., 2018, csls-10K). Given word vectors $s,t\in\mathbb{R}^{n}$ from a source and a target embedding, csls is a cross-lingual similarity metric,

[TABLE]

where $W$ is the trained mapping after each epoch, and $r(x)$ is the average cosine similarity of the top $10$ cross-lingual nearest neighbors of a word $x$ .

What if we use modularity instead? To test modularity as a validation metric for muse, we compute modularity on the lexical graph of 10K most frequent words (Mod-10K; we use 10K for consistency with csls on the same words) after each epoch of the adversarial step and the refinement step and select the best mapping.

The important difference between these two metrics is that Mod-10K considers the relative similarities between intra- and cross-lingual neighbors, while csls-10K only considers the similarities of cross-lingual nearest neighbors.777Another difference is that $k$ -nearest neighbors for csls-10K is $k=10$ , whereas Mod-10K uses $k=3$ . However, using $k=3$ for csls-10K leads to worse results; we therefore only report the result on the default metric i.e., $k=10$ .

Experiment Setup

We use the pre-trained fastText vectors (Bojanowski et al., 2017) to be comparable with the prior work. Following Artetxe et al. (2018a), all vectors are unit length normalized, mean centered, and then unit length normalized. We use the test lexicon by Conneau et al. (2018). We run ten times with the same random seeds and hyperparameters but with different validation metrics. Since muse is unstable on distant language pairs Artetxe et al. (2018a); Søgaard et al. (2018); Hoshen and Wolf (2018), we test it on English to languages from diverse language families: Indo-European languages such as Danish (da), German (de), Spanish (es), Farsi (fa), Italian (it), Hindi (hi), Bengali (bn), and non-Indo-European languages such as Finnish (fi), Hungarian (hu), Japanese (ja), Chinese (zh), Korean (ko), Arabic (ar), Indonesian (id), and Vietnamese (vi).

Results

Table 6 shows P@1 on bli for each target language using English as the source language. Mod-10K improves P@1 over the default validation metric in diverse languages, especially on the average P@1 for non-Germanic languages such as ja ( $+18.00\%$ ) and zh ( $+5.74\%$ ), and the best P@1 for ko ( $+1.85\%$ ). These language pairs include pairs (en-ja and en-hi), which are difficult for muse Hoshen and Wolf (2018). Improvements in ja come from selecting a better mapping during the refinement step, which the default validation misses. For zh, hi, and ko, the improvement comes from selecting better mappings during the adversarial step. However, modularity does not improve on all languages (e.g., vi) that are reported to fail by Hoshen and Wolf (2018).

6 Analysis: Understanding Modularity as an Evaluation Metric

The experiments so far show that modularity captures whether an embedding is useful, which suggests that modularity could be used as an intrinsic evaluation or validation metric. Here, we investigate whether modularity can capture distinct information compared to existing evaluation measures: qvec-cca Ammar et al. (2016), csls Conneau et al. (2018), and cosine similarity between translation pairs (Section 6.1). We also analyze the effect of the number of nearest neighbors $k$ (Section 6.2).

6.1 Ablation Study Using Linear Regression

We fit a linear regression model to predict the classification accuracy given four intrinsic measures: qvec-cca, csls, average cosine similarity of translations, and modularity. We ablate each of the four measures, fitting linear regression with standardized feature values, for two target languages (it and da) on the task of cross-lingual document classification (Figure 3). We limit to it and da because aligned supersense annotations to en ones (Miller et al., 1993), required for qvec-cca are only available in those languages (Montemagni et al., 2003; Martínez Alonso et al., 2015; Martınez Alonso et al., 2016; Ammar et al., 2016). We standardize the values of the four features before training the regression model.

Omitting modularity hurts accuracy prediction on cross-lingual document classification substantially, while omitting the other three measures has smaller effects (Figure 5). Thus, modularity complements the other measures and is more predictive of classification accuracy.

6.2 Hyperparameter Sensitivity

While modularity itself does not have any adjustable hyperparameters, our approach to constructing the lexical graph has two hyperparameters: the number of nearest neighbors ( $k$ ) and the number of trees ( $t$ ) for approximating the $k$ -nearest neighbors using random projection trees. We conduct a grid search for $k\in\{1,3,5,10,50,100,150,200\}$ and $t\in\{50,100,150,200,250,300,350,400,450,500\}$ using the German rcv2 corpus as the held-out language to tune hyperparameters.

The nearest neighbor $k$ has a much larger effect on modularity than $t$ , so we focus on analyzing the effect of $k$ , using the optimal $t=450$ . Our earlier experiments all use $k=3$ since it gives the highest Pearson’s and Spearman’s correlation on the tuning dataset (Figure 6). The absolute correlation between the downstream task decreases when setting $k>3$ , indicating nearest neighbors beyond $k=3$ are only contributing noise.

7 Discussion: What Modularity Can and Cannot Do

This work focuses on modularity as a diagnostic tool: it is cheap and effective at discovering which embeddings are likely to falter on downstream tasks. Thus, practitioners should consider including it as a metric for evaluating the quality of their embeddings. Additionally, we believe that modularity could serve as a useful prior for the algorithms that learn cross-lingual word embeddings: during learning prefer updates that avoid increasing modularity if all else is equal.

Nevertheless, we recognize limitations of modularity. Consider the following cross-lingual word embedding “algorithm”: for each word, select a random point on the unit hypersphere. This is a horrible distributed representation: the position of words’ embedding has no relationship to the underlying meaning. Nevertheless, this representation will have very low modularity. Thus, while modularity can identify bad embeddings, once vectors are well mixed, this metric—unlike qvec or qvec-cca—cannot identify whether the meanings make sense. Future work should investigate how to combine techniques that use both word meaning and nearest neighbors for a more robust, semi-supervised cross-lingual evaluation.

Acknowledgments

This work was supported by nsf grant iis-1564275 and by darpa award HR0011-15-C-0113 under subcontract to Raytheon bbn Technologies. The authors would like to thank Sebastian Ruder, Akiko Aizawa, the members of the clip lab at the University of Maryland, the members of the clear lab at the University of Colorado, and the anonymous reviewers for their feedback. The authors would like to also thank Mozhi Zhang for providing the deep averaging network code. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsor.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ammar et al. (2016) Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A Smith. 2016. Massively multilingual word embeddings. Computing Research Repository , ar Xiv:1602.01925. Version 2.
2Artetxe et al. (2016) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of Empirical Methods in Natural Language Processing .
3Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the Association for Computational Linguistics .
4Artetxe et al. (2018 a) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018 a. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the Association for Computational Linguistics .
5Artetxe et al. (2018 b) Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018 b. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Representations .
6Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics .
7Boyd-Graber and Blei (2009) Jordan Boyd-Graber and David M. Blei. 2009. Multilingual topic models for unaligned text. In Proceedings of Uncertainty in Artificial Intelligence .
8Christodouloupoulos and Steedman (2015) Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: The Bible in 100 languages. Proceedings of the Language Resources and Evaluation Conference .