A Challenge Set Approach to Evaluating Machine Translation

Pierre Isabelle; Colin Cherry; and George Foster

arXiv:1704.07431·cs.CL·August 30, 2017

A Challenge Set Approach to Evaluating Machine Translation

Pierre Isabelle, Colin Cherry, and George Foster

PDF

TL;DR

This paper introduces a challenge set method for evaluating machine translation systems by analyzing their ability to handle specific linguistic divergences, providing detailed insights into their strengths and remaining weaknesses.

Contribution

It presents a novel challenge set approach for detailed error analysis in machine translation, focusing on structural divergences between languages.

Findings

01

Neural machine translation outperforms phrase-based systems on many linguistic phenomena.

02

Certain complex linguistic divergences remain challenging for neural systems.

03

The approach offers a more nuanced evaluation of translation quality.

Abstract

Neural machine translation represents an exciting leap forward in translation quality. But what longstanding weaknesses does it resolve, and which remain? We address these questions with a challenge set approach to translation evaluation and error analysis. A challenge set consists of a small set of sentences, each hand-designed to probe a system's capacity to bridge a particular structural divergence between languages. To exemplify this approach, we present an English-French challenge set, and use it to analyze phrase-based and neural systems. The resulting analysis provides not only a more fine-grained picture of the strengths of neural systems, but also insight into which linguistic phenomena remain out of reach.

Tables19

Table 1. Table 1: Corpus statistics. The WMT12/13 eval sets are used for dev, and the WMT14 eval set is used for test.

corpus	lines	en words	fr words
train	12.1M	304M	348M
mono	15.9M	—-	406M
dev	6003	138k	155k
test	3003	71k	81k

Table 2. Table 2: Summary performance statistics for each system under study, including challenge set success rate grouped by linguistic category (aggregating all positive judgments and dividing by total judgments), as well as BLEU scores on the WMT 2014 test set. The final column gives the proportion of system outputs on which all three annotators agreed.

Divergence type	PBMT-1	PBMT-2	NMT	Google NMT	Agreement
Morpho-syntactic	16%	16%	72%	65%	94%
Lexico-syntactic	42%	46%	52%	62%	94%
Syntactic	33%	33%	40%	75%	81%
Overall	31%	32%	53%	68%	89%
WMT BLEU	34.2	36.5	36.9	—	—

Table 3. Table 3: Summary of scores by fine-grained categories. “#” reports number of questions in each category, while the reported score is the percentage of questions for which the divergence was correctly bridged. For each question, the three human judgments were transformed into a single judgment by taking system outputs with two positive judgments as positive, and all others as negative.

Category	Subcategory	#	PBMT-1	NMT	Google NMT
Morpho-syntactic	Agreement across distractors	3	0%	100%	100%
	through control verbs	4	25%	25%	25%
	with coordinated target	3	0%	100%	100%
	with coordinated source	12	17%	92%	75%
	of past participles	4	25%	75%	75%
	Subjunctive mood	3	33%	33%	67%
Lexico-syntactic	Argument switch	3	0%	0%	0%
	Double-object verbs	3	33%	67%	100%
	Fail-to	3	67%	100%	67%
	Manner-of-movement verbs	4	0%	0%	0%
	Overlapping subcat frames	5	60%	100%	100%
	NP-to-VP	3	33%	67%	67%
	Factitives	3	0%	33%	67%
	Noun compounds	9	67%	67%	78%
	Common idioms	6	50%	0%	33%
	Syntactically flexible idioms	2	0%	0%	0%
Syntactic	Yes-no question syntax	3	33%	100%	100%
	Tag questions	3	0%	0%	100%
	Stranded preps	6	0%	0%	100%
	Adv-triggered inversion	3	0%	0%	33%
	Middle voice	3	0%	0%	0%
	Fronted should	3	67%	33%	33%
	Clitic pronouns	5	40%	80%	60%
	Ordinal placement	3	100%	100%	100%
	Inalienable possession	6	50%	17%	83%
	Zero REL PRO	3	0%	33%	100%

Morpho-Syntactic
S-V agreement, across distractors
Is subject-verb agrement correct? (Possible interference from distractors between the subject’s head and the verb).
S1a	Source	The repeated calls from his mother should have alerted us.
	Ref	Les appels répétés de sa mère auraient dû nous alerter.
	PBMT-1	Les appels répétés de sa mère aurait dû nous a alertés. ✗
	NMT	Les appels répétés de sa mère devraient nous avoir alertés. ✓
	Google	Les appels répétés de sa mère auraient dû nous alerter. ✓
S1b	Source	The sudden noise in the upper rooms should have alerted us.
	Ref	Le bruit soudain dans les chambres supérieures aurait dû nous alerter.
	PBMT-1	Le bruit soudain dans les chambres supérieures auraient dû nous a alertés. ✗
	NMT	Le bruit soudain dans les chambres supérieures devrait nous avoir alerté. ✓
	Google	Le bruit soudain dans les chambres supérieures devrait nous avoir alerté. ✓
S1c	Source	Their repeated failures to report the problem should have alerted us.
	Ref	Leurs échecs répétés à signaler le problème auraient dû nous alerter.
	PBMT-1	Leurs échecs répétés de signaler le problème aurait dû nous a alertés. ✗
	NMT	Leurs échecs répétés pour signaler le problème devraient nous avoir alertés. ✓
	Google	Leur échec répété à signaler le problème aurait dû nous alerter. ✓
S-V agreement, through control verbs
Does the flagged adjective agree correctly with its subject? (Subject-control versus object-control verbs).
S2a	Source	She asked her brother not to be arrogant.
	Ref	Elle a demandé à son frère de ne pas se montrer arrogant.
	PBMT-1	Elle a demandé à son frère de ne pas être arrogant. ✓
	NMT	Elle a demandé à son frère de ne pas être arrogant. ✓
	Google	Elle a demandé à son frère de ne pas être arrogant. ✓
S2b	Source	She promised her brother not to be arrogant.
	Ref	Elle a promis à son frère de ne pas être arrogante.
	PBMT-1	Elle a promis son frère à ne pas être arrogant. ✗
	NMT	Elle a promis à son frère de ne pas être arrogant. ✗
	Google	Elle a promis à son frère de ne pas être arrogant. ✗
S2c	Source	She promised her doctor to remain active after retiring.
	Ref	Elle a promis à son médecin de demeurer active après s’être retirée.
	PBMT-1	Elle a promis son médecin pour demeurer actif après sa retraite. ✗
	NMT	Elle a promis à son médecin de rester actif après sa retraite. ✗
	Google	Elle a promis à son médecin de rester actif après sa retraite. ✗
S2d	Source	My mother promised my father to be more prudent on the road.
	Ref	Ma mère a promis à mon père d’être plus prudente sur la route.
	PBMT-1	Ma mère, mon père a promis d’être plus prudent sur la route. ✗
	NMT	Ma mère a promis à mon père d’être plus prudent sur la route. ✗
	Google	Ma mère a promis à mon père d’être plus prudent sur la route. ✗

S-V agreement, coordinated targets
Do the marked verbs/adjective agree correctly with their subject? (Agreement distribution over coordinated predicates)
S3a	Source	The woman was very tall and extremely strong.
	Ref	La femme était très grande et extrêmement forte.
	PBMT-1	La femme était très gentil et extrêmement forte. ✗
	NMT	La femme était très haute et extrêmement forte. ✓
	Google	La femme était très grande et extrêmement forte. ✓
S3b	Source	Their politicians were more ignorant than stupid.
	Ref	Leurs politiciens étaient plus ignorants que stupides.
	PBMT-1	Les politiciens étaient plus ignorants que stupide. ✗
	NMT	Leurs politiciens étaient plus ignorants que stupides. ✓
	Google	Leurs politiciens étaient plus ignorants que stupides. ✓
S3c	Source	We shouted an insult and left abruptly.
	Ref	Nous avons lancé une insulte et nous sommes partis brusquement.
	PBMT-1	Nous avons crié une insulte et a quitté abruptement. ✗
	NMT	Nous avons crié une insulte et nous avons laissé brusquement. ✓
	Google	Nous avons crié une insulte et nous sommes partis brusquement. ✓
S-V agreement, feature calculus on coordinated source
Do the marked verbs/adjective agree correctly with their subject? (Masculine singular ET masculine singular yields masculine plural).
S4a1	Source	The cat and the dog should be watched.
	Ref	Le chat et le chien devraient être surveillés.
	PBMT-1	Le chat et le chien doit être regardée. ✗
	NMT	Le chat et le chien doivent être regardés. ✓
	Google	Le chat et le chien doivent être surveillés. ✓
S4a2	Source	My father and my brother will be happy tomorrow.
	Ref	Mon père et mon frère seront heureux demain.
	PBMT-1	Mon père et mon frère sera heureux de demain. ✗
	NMT	Mon père et mon frère seront heureux demain. ✓
	Google	Mon père et mon frère seront heureux demain. ✓
S4a3	Source	My book and my pencil could be stolen.
	Ref	Mon livre et mon crayon pourraient être volés.
	PBMT-1	Mon livre et mon crayon pourrait être volé. ✗
	NMT	Mon livre et mon crayon pourraient être volés. ✓
	Google	Mon livre et mon crayon pourraient être volés. ✓
Do the marked verbs/adjectives agree correctly with their subject? (Feminine singular ET feminine singular yields feminine plural).
S4b1	Source	The cow and the hen must be fed.
	Ref	La vache et la poule doivent être nourries.
	PBMT-1	La vache et de la poule doivent être nourris. ✗
	NMT	La vache et la poule doivent être alimentées. ✓
	Google	La vache et la poule doivent être nourries. ✓

S4b2	Source	My mother and my sister will be happy tomorrow.
	Ref	Ma mère et ma sœur seront heureuses demain.
	PBMT-1	Ma mère et ma sœur sera heureux de demain. ✗
	NMT	Ma mère et ma sœur seront heureuses demain. ✓
	Google	Ma mère et ma sœur seront heureuses demain. ✓
S4b3	Source	My shoes and my socks will be found.
	Ref	Mes chaussures et mes chaussettes seront retrouvées.
	PBMT-1	Mes chaussures et mes chaussettes sera trouvé. ✗
	NMT	Mes chaussures et mes chaussettes seront trouvées. ✓
	Google	Mes chaussures et mes chaussettes seront trouvées. ✓
Do the marked verbs/adjectives agree correctly with their subject? (Masculine singular ET feminine singular yields masculine plural.)
S4c1	Source	The dog and the cow are nervous.
	Ref	Le chien et la vache sont nerveux.
	PBMT-1	Le chien et la vache sont nerveux. ✓
	NMT	Le chien et la vache sont nerveux. ✓
	Google	Le chien et la vache sont nerveux. ✓
S4c2	Source	My father and my mother will be happy tomorrow.
	Ref	Mon père et ma mère seront heureux demain.
	PBMT-1	Mon père et ma mère se fera un plaisir de demain. ✗
	NMT	Mon père et ma mère seront heureux demain. ✓
	Google	Mon père et ma mère seront heureux demain. ✓
S4c3	Source	My refrigerator and my kitchen table were stolen.
	Ref	Mon réfrigérateur et ma table de cuisine ont été volés.
	PBMT-1	Mon réfrigérateur et ma table de cuisine ont été volés. ✓
	NMT	Mon réfrigérateur et ma table de cuisine ont été volés. ✓
	Google	Mon réfrigérateur et ma table de cuisine ont été volés. ✓
Do the marked verbs/adjectives agree correctly with their subject? (Smallest coordinated grammatical person wins.)
S4d1	Source	Paul and I could easily be convinced to join you.
	Ref	Paul et moi pourrions facilement être convaincus de se joindre à vous.
	PBMT-1	Paul et je pourrais facilement être persuadée de se joindre à vous. ✗
	NMT	Paul et moi avons facilement pu être convaincus de vous rejoindre. ✓
	Google	Paul et moi pourrait facilement être convaincu de vous rejoindre. ✗
S4d2	Source	You and he could be surprised by her findings.
	Ref	Vous et lui pourriez être surpris par ses découvertes.
	PBMT-1	Vous et qu’il pouvait être surpris par ses conclusions. ✗
	NMT	Vous et lui pourriez être surpris par ses conclusions. ✓
	Google	Vous et lui pourrait être surpris par ses découvertes. ✗

S4d3	Source	We and they are on different courses.
	Ref	Nous et eux sommes sur des trajectoires différentes.
	PBMT-1	Nous et ils sont en cours de différents. ✗
	NMT	Nous et nous sommes sur des parcours différents. ✗
	Google	Nous et ils sont sur des parcours différents. ✗
S-V agreement, past participles
Are the agreement marks of the flagged participles the correct ones? (Past participle placed after auxiliary AVOIR agrees with verb object iff object precedes auxiliary. Otherwise participle is in masculine singular form).
S5a	Source	The woman who saw a mouse in the corridor is charming.
	Ref	La femme qui a vu une souris dans le couloir est charmante.
	PBMT-1	La femme qui a vu une souris dans le couloir est charmante. ✓
	NMT	La femme qui a vu une souris dans le couloir est charmante. ✓
	Google	La femme qui a vu une souris dans le couloir est charmante. ✓
S5b	Source	The woman that your brother saw in the corridor is charming.
	Ref	La femme que votre frère a vue dans le couloir est charmante.
	PBMT-1	La femme que ton frère a vu dans le couloir est charmante. ✗
	NMT	La femme que votre frère a vu dans le corridor est charmante. ✗
	Google	La femme que votre frère a vue dans le couloir est charmante. ✓
S5c	Source	The house that John has visited is crumbling.
	Ref	La maison que John a visitée tombe en ruines.
	PBMT-1	La maison que John a visité est en train de s’écrouler. ✗
	NMT	La maison que John a visitée est en train de s’effondrer. ✓
	Google	La maison que John a visité est en ruine. ✗
S5d	Source	John sold the car that he had won in a lottery.
	Ref	John a vendu la voiture qu’il avait gagnée dans une loterie.
	PBMT-1	John a vendu la voiture qu’il avait gagné à la loterie. ✗
	NMT	John a vendu la voiture qu’il avait gagnée dans une loterie. ✓
	Google	John a vendu la voiture qu’il avait gagnée dans une loterie. ✓
Subjunctive mood
Is the flagged verb in the correct mood? (Certain triggering verbs, adjectives or subordinate conjunctions, induce the subjunctive mood in the subordinate clause that they govern).
S6a	Source	He will come provided that you come too.
	Ref	Il viendra à condition que vous veniez aussi.
	PBMT-1	Il viendra à condition que vous venez aussi. ✗
	NMT	Il viendra lui aussi que vous le faites. ✗
	Google	Il viendra à condition que vous venez aussi. ✗
S6b	Source	It is unfortunate that he is not coming either.
	Ref	Il est malheureux qu’il ne vienne pas non plus.
	PBMT-1	Il est regrettable qu’il n’est pas non plus à venir. ✗
	NMT	Il est regrettable qu’il ne soit pas non plus. ✗
	Google	Il est malheureux qu’il ne vienne pas non plus. ✓

S6c	Source	I requested that families not be separated.
	Ref	J’ai demandé que les familles ne soient pas séparées.
	PBMT-1	J’ai demandé que les familles ne soient pas séparées. ✓
	NMT	J’ai demandé que les familles ne soient pas séparées. ✓
	Google	J’ai demandé que les familles ne soient pas séparées. ✓
Lexico-Syntactic
Argument switch
Are the experiencer and the object of the “missing” situation correctly preserved in the French translation? (Argument switch).
S7a	Source	Mary sorely misses Jim.
	Ref	Jim manque cruellement à Mary.
	PBMT-1	Marie manque cruellement de Jim. ✗
	NMT	Mary a lamentablement manqué de Jim. ✗
	Google	Mary manque cruellement à Jim. ✗
S7b	Source	My sister is really missing New York.
	Ref	New York manque beaucoup à ma sœur.
	PBMT-1	Ma sœur est vraiment absent de New York. ✗
	NMT	Ma sœur est vraiment manquante à New York. ✗
	Google	Ma sœur manque vraiment New York. ✗
S7c	Source	What he misses most is his dog.
	Ref	Ce qui lui manque le plus, c’est son chien.
	PBMT-1	Ce qu’il manque le plus, c’est son chien. ✗
	NMT	Ce qu’il manque le plus, c’est son chien. ✗
	Google	Ce qu’il manque le plus, c’est son chien. ✗
Double-object verbs
Are “gift” and “recipient” arguments correctly rendered in French? (English double-object constructions)
S8a	Source	John gave his wonderful wife a nice present.
	Ref	John a donné un beau présent à sa merveilleuse épouse.
	PBMT-1	John a donné sa merveilleuse femme un beau cadeau. ✗
	NMT	John a donné à sa merveilleuse femme un beau cadeau. ✓
	Google	John a donné à son épouse merveilleuse un présent gentil. ✓
S8b	Source	John told the kids a nice story.
	Ref	John a raconté une belle histoire aux enfants.
	PBMT-1	John a dit aux enfants une belle histoire. ✓
	NMT	John a dit aux enfants une belle histoire. ✓
	Google	John a raconté aux enfants une belle histoire. ✓
S8c	Source	John sent his mother a nice postcard.
	Ref	John a envoyé une belle carte postale à sa mère.
	PBMT-1	John a envoyé sa mère une carte postale de nice. ✗
	NMT	John a envoyé sa mère une carte postale de nice. ✗
	Google	John envoya à sa mère une belle carte postale. ✓

Fail to
Is the meaning of “fail to” correctly rendered in the French translation?
S9a	Source	John failed to see the relevance of this point.
	Ref	John n’a pas vu la pertinence de ce point.
	PBMT-1	John a omis de voir la pertinence de ce point. ✗
	NMT	John n’a pas vu la pertinence de ce point. ✓
	Google	John a omis de voir la pertinence de ce point. ✗
S9b	Source	He failed to respond.
	Ref	Il n’a pas répondu.
	PBMT-1	Il n’a pas réussi à répondre. ✓
	NMT	Il n’a pas répondu. ✓
	Google	Il n’a pas répondu. ✓
S9c	Source	Those who fail to comply with this requirement will be penalized.
	Ref	Ceux qui ne se conforment pas à cette exigence seront pénalisés.
	PBMT-1	Ceux qui ne se conforment pas à cette obligation seront pénalisés. ✓
	NMT	Ceux qui ne se conforment pas à cette obligation seront pénalisés. ✓
	Google	Ceux qui ne respectent pas cette exigence seront pénalisés. ✓
Manner-of-movement verbs
Is the movement action expressed in the English source correctly rendered in French? (Manner-of-movement verbs with path argument may need to be rephrased in French).
S10a	Source	John would like to swim across the river.
	Ref	John aimerait traverser la rivière à la nage.
	PBMT-1	John aimerait nager dans la rivière. ✗
	NMT	John aimerait nager à travers la rivière. ✗
	Google	John aimerait nager à travers la rivière. ✗
S10b	Source	They ran into the room.
	Ref	Ils sont entrés dans la chambre à la course.
	PBMT-1	Ils ont couru dans la chambre. ✗
	NMT	Ils ont couru dans la pièce. ✗
	Google	Ils coururent dans la pièce. ✗
S10c	Source	The man ran out of the park.
	Ref	L’homme est sorti du parc en courant.
	PBMT-1	L’homme a manqué du parc. ✗
	NMT	L’homme s’enfuit du parc. ✗
	Google	L’homme sortit du parc. ✗
Hard example featuring spontaneous noun-to-verb derivation (“nonce verb”).
S10d	Source	John guitared his way to San Francisco.
	Ref	John s’est rendu jusqu’à San Francisco en jouant de la guitare.
	PBMT-1	John guitared son chemin à San Francisco. ✗
	NMT	John guitared sa route à San Francisco. ✗
	Google	John a guité son chemin à San Francisco. ✗

Overlapping subcat frames
Is the French verb for “know” correctly chosen? (Choice between “savoir”/“connaître” depends on syntactic nature of its object)
S11a	Source	Paul knows that this is a fact.
	Ref	Paul sait que c’est un fait.
	PBMT-1	Paul sait que c’est un fait. ✓
	NMT	Paul sait que c’est un fait. ✓
	Google	Paul sait que c’est un fait. ✓
S11b	Source	Paul knows this story.
	Ref	Paul connaît cette histoire.
	PBMT-1	Paul connaît cette histoire. ✓
	NMT	Paul connaît cette histoire. ✓
	Google	Paul connaît cette histoire. ✓
S11c	Source	Paul knows this story is hard to believe.
	Ref	Paul sait que cette histoire est difficile à croire.
	PBMT-1	Paul connaît cette histoire est difficile à croire. ✗
	NMT	Paul sait que cette histoire est difficile à croire. ✓
	Google	Paul sait que cette histoire est difficile à croire. ✓
S11d	Source	He knows my sister will not take it.
	Ref	Il sait que ma soeur ne le prendra pas.
	PBMT-1	Il sait que ma soeur ne prendra pas. ✓
	NMT	Il sait que ma soeur ne le prendra pas. ✓
	Google	Il sait que ma soeur ne le prendra pas. ✓
S11e	Source	My sister knows your son is reliable.
	Ref	Ma sœur sait que votre fils est fiable.
	PBMT-1	Ma soeur connaît votre fils est fiable. ✗
	NMT	Ma sœur sait que votre fils est fiable. ✓
	Google	Ma sœur sait que votre fils est fiable. ✓
NP to VP
Is the English “NP to VP” complement correctly rendred in the French translation? (Sometimes one needs to translate this structure as a finite clause).
S12a	Source	John believes Bill to be dishonest.
	Ref	John croit que Bill est malhonnête.
	PBMT-1	John estime que le projet de loi soit malhonnête. ✓
	NMT	John croit que le projet de loi est malhonnête. ✓
	Google	John croit que Bill est malhonnête. ✓
S12b	Source	He liked his father to tell him stories.
	Ref	Il aimait que son père lui raconte des histoires.
	PBMT-1	Il aimait son père pour lui raconter des histoires. ✗
	NMT	Il aimait son père pour lui raconter des histoires. ✗
	Google	Il aimait son père à lui raconter des histoires. ✗

S12c	Source	She wanted her mother to let her go.
	Ref	Elle voulait que sa mère la laisse partir.
	PBMT-1	Elle voulait que sa mère de lui laisser aller. ✗
	NMT	Elle voulait que sa mère la laisse faire. ✓
	Google	Elle voulait que sa mère la laisse partir. ✓
Factitives
Is the English verb correctly rendered in the French translation? (Agentive use of some French verbs require embedding under “faire”).
S13a	Source	John cooked a big chicken.
	Ref	John a fait cuire un gros poulet.
	PBMT-1	John cuit un gros poulet. ✗
	NMT	John cuit un gros poulet. ✗
	Google	John a fait cuire un gros poulet. ✓
S13b	Source	John melted a lot of ice.
	Ref	John a fait fondre beaucoup de glace.
	PBMT-1	John fondu a lot of ice. ✗
	NMT	John a fondu beaucoup de glace. ✗
	Google	John a fondu beaucoup de glace. ✗
S13c	Source	She likes to grow flowers.
	Ref	Elle aime faire pousser des fleurs.
	PBMT-1	Elle aime à se développer des fleurs. ✗
	NMT	Elle aime à cultiver des fleurs. ✓
	Google	Elle aime faire pousser des fleurs. ✓
Noun Compounds
Is the English nominal compound rendered with the right preposition in the French translation?
S14a	Source	Use the meat knife.
	Ref	Utilisez le couteau à viande.
	PBMT-1	Utilisez le couteau de viande. ✗
	NMT	Utilisez le couteau à viande. ✓
	Google	Utilisez le couteau à viande. ✓
S14b	Source	Use the butter knife.
	Ref	Utilisez le couteau à beurre.
	PBMT-1	Utilisez le couteau à beurre. ✓
	NMT	Utilisez le couteau au beurre. ✗
	Google	Utilisez le couteau à beurre. ✓
S14c	Source	Use the steak knife.
	Ref	Utilisez le couteau à steak.
	PBMT-1	Utilisez le steak couteau. ✗
	NMT	Utilisez le couteau à steak. ✓
	Google	Utilisez le couteau de steak. ✗

S14d	Source	Clean the water filter.
	Ref	Nettoyez le filtre à eau.
	PBMT-1	Nettoyez le filtre à eau. ✓
	NMT	Nettoyez le filtre à eau. ✓
	Google	Nettoyez le filtre à eau. ✓
S14e	Source	Clean the juice filter.
	Ref	Nettoyez le filtre à jus.
	PBMT-1	Nettoyez le filtre de jus. ✗
	NMT	Nettoyez le filtre de jus. ✗
	Google	Nettoyez le filtre à jus. ✓
S14f	Source	Clean the tea filter.
	Ref	Nettoyez le filtre à thé.
	PBMT-1	Nettoyez le filtre à thé. ✓
	NMT	Nettoyez le filtre de thé. ✗
	Google	Nettoyez le filtre à thé. ✓
S14g	Source	Clean the cloth filter.
	Ref	Nettoyez le filtre en tissu.
	PBMT-1	Nettoyez le filtre en tissu. ✓
	NMT	Nettoyez le filtre en tissu. ✓
	Google	Nettoyez le filtre en tissu. ✓
S14h	Source	Clean the metal filter.
	Ref	Nettoyez le filtre en métal.
	PBMT-1	Nettoyez le filtre en métal. ✓
	NMT	Nettoyez le filtre en métal. ✓
	Google	Nettoyez le filtre métallique. ✓
S14i	Source	Clean the paper filter.
	Ref	Nettoyez le filtre en papier.
	PBMT-1	Nettoyez le filtre en papier. ✓
	NMT	Nettoyez le filtre en papier. ✓
	Google	Nettoyez le filtre à papier. ✗
Common idioms
Is the English idiomatic expression correctly rendered with a suitable French idiomatic expression?
S15a	Source	Stop beating around the bush.
	Ref	Cessez de tourner autour du pot.
	PBMT-1	Cesser de battre la campagne. ✗
	NMT	Arrêtez de battre autour de la brousse. ✗
	Google	Arrêter de tourner autour du pot. ✓

S15b	Source	You are putting the cart before the horse.
	Ref	Vous mettez la charrue devant les bœufs.
	PBMT-1	Vous pouvez mettre la charrue avant les bœufs. ✓
	NMT	Vous mettez la charrue avant le cheval. ✗
	Google	Vous mettez le chariot devant le cheval. ✗
S15c	Source	His comment proved to be the straw that broke the camel’s back.
	Ref	Son commentaire s’est avéré être la goutte d’eau qui a fait déborder le vase.
	PBMT-1	Son commentaire s’est révélé être la goutte d’eau qui fait déborder le vase. ✓
	NMT	Son commentaire s’est avéré être la paille qui a brisé le dos du chameau. ✗
	Google	Son commentaire s’est avéré être la paille qui a cassé le dos du chameau. ✗
S15d	Source	His argument really hit the nail on the head.
	Ref	Son argument a vraiment fait mouche.
	PBMT-1	Son argument a vraiment mis le doigt dessus. ✓
	NMT	Son argument a vraiment frappé le clou sur la tête. ✗
	Google	Son argument a vraiment frappé le clou sur la tête. ✗
S15e	Source	It’s no use crying over spilt milk.
	Ref	Ce qui est fait est fait.
	PBMT-1	Ce n’est pas de pleurer sur le lait répandu. ✗
	NMT	Il ne sert à rien de pleurer sur le lait haché. ✗
	Google	Ce qui est fait est fait. ✓
S15f	Source	It is no use crying over spilt milk.
	Ref	Ce qui est fait est fait.
	PBMT-1	Il ne suffit pas de pleurer sur le lait répandu. ✗
	NMT	Il ne sert à rien de pleurer sur le lait écrémé. ✗
	Google	Il est inutile de pleurer sur le lait répandu. ✗
Syntactically flexible idioms
Is the English idiomatic expression correctly rendered with a suitable French idiomatic expression?
S16a	Source	The cart has been put before the horse.
	Ref	La charrue a été mise devant les bœufs.
	PBMT-1	On met la charrue devant le cheval. ✗
	NMT	Le chariot a été mis avant le cheval. ✗
	Google	Le chariot a été mis devant le cheval. ✗
S16b	Source	With this argument, the nail has been hit on the head.
	Ref	Avec cet argument, la cause est entendue.
	PBMT-1	Avec cette argument, l’ongle a été frappée à la tête. ✗
	NMT	Avec cet argument, l’ongle a été touché à la tête. ✗
	Google	Avec cet argument, le clou a été frappé sur la tête. ✗

Syntactic
Yes-no question syntax
Is the English question correctly rendered as a French question?
S17a	Source	Have the kids ever watched that movie?
	Ref	Les enfants ont-ils déjà vu ce film?
	PBMT-1	Les enfants jamais regardé ce film? ✗
	NMT	Les enfants ont-ils déjà regardé ce film? ✓
	Google	Les enfants ont-ils déjà regardé ce film? ✓
S17b	Source	Hasn’t your boss denied you a promotion?
	Ref	Votre patron ne vous a-t-il pas refusé une promotion?
	PBMT-1	N’a pas nié votre patron vous un promotion? ✗
	NMT	Est-ce que votre patron vous a refusé une promotion? ✓
	Google	Votre patron ne vous a-t-il pas refusé une promotion? ✓
S17c	Source	Shouldn’t I attend this meeting?
	Ref	Ne devrais-je pas assister à cette réunion?
	PBMT-1	Ne devrais-je pas assister à cette réunion? ✓
	NMT	Est-ce que je ne devrais pas assister à cette réunion? ✓
	Google	Ne devrais-je pas assister à cette réunion? ✓
Tag questions
Is the English “tag question” element correctly rendered in the translation?
S18a	Source	Mary looked really happy tonight, didn’t she?
	Ref	Mary avait l’air vraiment heureuse ce soir, n’est-ce pas?
	PBMT-1	Marie a regardé vraiment heureux de ce soir, n’est-ce pas elle? ✗
	NMT	Mary s’est montrée vraiment heureuse ce soir, ne l’a pas fait? ✗
	Google	Mary avait l’air vraiment heureuse ce soir, n’est-ce pas? ✓
S18b	Source	We should not do that again, should we?
	Ref	Nous ne devrions pas refaire cela, n’est-ce pas?
	PBMT-1	Nous ne devrions pas faire qu’une fois encore, faut-il? ✗
	NMT	Nous ne devrions pas le faire encore, si nous? ✗
	Google	Nous ne devrions pas recommencer, n’est-ce pas? ✓
S18c	Source	She was perfect tonight, was she not?
	Ref	Elle était parfaite ce soir, n’est-ce pas?
	PBMT-1	Elle était parfait ce soir, elle n’était pas? ✗
	NMT	Elle était parfaite ce soir, n’était-elle pas? ✗
	Google	Elle était parfaite ce soir, n’est-ce pas? ✓
WH-MVT and stranded preps
Is the dangling preposition of the English sentence correctly placed in the French translation?
S19a	Source	The guy that she is going out with is handsome.
	Ref	Le type avec qui elle sort est beau.
	PBMT-1	Le mec qu’elle va sortir avec est beau. ✗
	NMT	Le mec qu’elle sort avec est beau. ✗
	Google	Le mec avec qui elle sort est beau. ✓

S19b	Source	Whom is she going out with these days?
	Ref	Avec qui sort-elle ces jours-ci?
	PBMT-1	Qu’est-ce qu’elle allait sortir avec ces jours? ✗
	NMT	À qui s’adresse ces jours-ci? ✗
	Google	Avec qui sort-elle de nos jours? ✓
S19c	Source	The girl that he has been talking about is smart.
	Ref	La fille dont il a parlé est brillante.
	PBMT-1	La jeune fille qu’il a parlé est intelligent. ✗
	NMT	La fille qu’il a parlé est intelligente. ✗
	Google	La fille dont il a parlé est intelligente. ✓
S19d	Source	Who was he talking to when you left?
	Ref	À qui parlait-il au moment où tu es parti?
	PBMT-1	Qui est lui parler quand vous avez quitté? ✗
	NMT	Qui a-t-il parlé à quand vous avez quitté? ✗
	Google	Avec qui il parlait quand vous êtes parti? ✓
S19e	Source	The city that he is arriving from is dangerous.
	Ref	La ville d’où il arrive est dangereuse.
	PBMT-1	La ville qu’il est arrivé de est dangereuse. ✗
	NMT	La ville qu’il est en train d’arriver est dangereuse. ✗
	Google	La ville d’où il vient est dangereuse. ✓
S19f	Source	Where is he arriving from?
	Ref	D’où arrive-t-il?
	PBMT-1	Où est-il arrivé? ✗
	NMT	De quoi s’agit-il? ✗
	Google	D’où vient-il? ✓
Adverb-triggered inversion
Is the adverb-triggered subject-verb inversion in the English sentence correctly rendered in the French translation?
S20a	Source	Rarely did the dog run.
	Ref	Rarement le chien courait-il.
	PBMT-1	Rarement le chien courir. ✗
	NMT	Il est rare que le chien marche. ✗
	Google	Rarement le chien courir. ✗
S20b	Source	Never before had she been so unhappy.
	Ref	Jamais encore n’avait-elle été aussi malheureuse.
	PBMT-1	Jamais auparavant, si elle avait été si malheureux. ✗
	NMT	Jamais auparavant n’avait été si malheureuse. ✗
	Google	Jamais elle n’avait été aussi malheureuse. ✓

S20c	Source	Nowhere were the birds so colorful.
	Ref	Nulle part les oiseaux n’étaient si colorés.
	PBMT-1	Nulle part les oiseaux de façon colorée. ✗
	NMT	Les oiseaux ne sont pas si colorés. ✗
	Google	Nulle part les oiseaux étaient si colorés. ✗
Middle voice
Is the generic statement made in the English sentence correctly and naturally rendered in the French translation?
S21a	Source	Soup is eaten with a large spoon.
	Ref	La soupe se mange avec une grande cuillère
	PBMT-1	La soupe est mangé avec une grande cuillère. ✗
	NMT	La soupe est consommée avec une grosse cuillère. ✗
	Google	La soupe est consommée avec une grande cuillère. ✗
S21b	Source	Masonry is cut using a diamond blade.
	Ref	La maçonnerie se coupe avec une lame à diamant.
	PBMT-1	La maçonnerie est coupé à l’aide d’une lame de diamant. ✗
	NMT	La maçonnerie est coupée à l’aide d’une lame de diamant. ✗
	Google	La maçonnerie est coupée à l’aide d’une lame de diamant. ✗
S21c	Source	Champagne is drunk in a glass called a flute.
	Ref	Le champagne se boit dans un verre appelé flûte.
	PBMT-1	Le champagne est ivre dans un verre appelé une flûte. ✗
	NMT	Le champagne est ivre dans un verre appelé flûte. ✗
	Google	Le Champagne est bu dans un verre appelé flûte. ✗
Fronted “should”
Fronted “should” is interpreted as a conditional subordinator. It is normally translated as “si” with imperfect tense.
S22a	Source	Should Paul leave, I would be sad.
	Ref	Si Paul devait s’en aller, je serais triste.
	PBMT-1	Si le congé de Paul, je serais triste. ✗
	NMT	Si Paul quitte, je serais triste. ✗
	Google	Si Paul s’en allait, je serais triste. ✓
S22b	Source	Should he become president, she would be promoted immediately.
	Ref	S’il devait devenir président, elle recevrait immédiatement une promotion.
	PBMT-1	S’il devait devenir président, elle serait encouragée immédiatement. ✓
	NMT	S’il devait devenir président, elle serait immédiatement promue. ✓
	Google	Devrait-il devenir président, elle serait immédiatement promue. ✗
S22c	Source	Should he fall, he would get up again immediately.
	Ref	S’ il venait à tomber, il se relèverait immédiatement.
	PBMT-1	S’il devait tomber, il allait se lever immédiatement de nouveau. ✓
	NMT	S’il tombe, il serait de nouveau immédiatement. ✗
	Google	S’il tombe, il se lèvera immédiatement. ✗

Clitic pronouns
Are the English pronouns correctly rendered in the French translations?
S23a	Source	She had a lot of money but he did not have any.
	Ref	Elle avait beaucoup d’argent mais il n’en avait pas.
	PBMT-1	Elle avait beaucoup d’argent mais il n’en avait pas. ✓
	NMT	Elle avait beaucoup d’argent, mais il n’a pas eu d’argent. ✓
	Google	Elle avait beaucoup d’argent mais il n’en avait pas. ✓
S23b	Source	He did not talk to them very often.
	Ref	Il ne leur parlait pas très souvent.
	PBMT-1	Il n’a pas leur parler très souvent. ✗
	NMT	Il ne leur a pas parlé très souvent. ✓
	Google	Il ne leur parlait pas très souvent. ✓
S23c	Source	The men are watching each other.
	Ref	Les hommes se surveillent l’un l’autre
	PBMT-1	Les hommes se regardent les uns les autres. ✓
	NMT	Les hommes se regardent les uns les autres. ✓
	Google	Les hommes se regardent. ✗
S23d	Source	He gave it to the man.
	Ref	Il le donna à l’homme.
	PBMT-1	Il a donné à l’homme. ✗
	NMT	Il l’a donné à l’homme. ✓
	Google	Il le donna à l’homme. ✓
S23e	Source	He did not give it to her.
	Ref	Il ne le lui a pas donné.
	PBMT-1	Il ne lui donner. ✗
	NMT	Il ne l’a pas donné à elle. ✗
	Google	Il ne lui a pas donné. ✗
Ordinal placement
Is the relative order of the ordinals and numerals correct in the French tranlation?
S24a	Source	The first four men were exhausted.
	Ref	Les quatre premiers hommes étaient tous épuisés.
	PBMT-1	Les quatre premiers hommes étaient épuisés. ✓
	NMT	Les quatre premiers hommes ont été épuisés. ✓
	Google	Les quatre premiers hommes étaient épuisés. ✓
S24b	Source	The last three candidates were eliminated.
	Ref	Les trois derniers candidats ont été éliminés.
	PBMT-1	Les trois derniers candidats ont été éliminés. ✓
	NMT	Les trois derniers candidats ont été éliminés. ✓
	Google	Les trois derniers candidats ont été éliminés. ✓

S24c	Source	The other two guys left without paying.
	Ref	Les deux autres types sont partis sans payer.
	PBMT-1	Les deux autres mecs ont laissé sans payer. ✓
	NMT	Les deux autres gars à gauche sans payer. ✓
	Google	Les deux autres gars sont partis sans payer. ✓
Inalienable possession
Is the French translation correct and natural both in: a) its use of a particular determiner on the body part noun; and b) the presence or absence of a reflexive pronoun before the verb?
S25a	Source	He washed his hands.
	Ref	Il s’est lavé les mains.
	PBMT-1	Il se lavait les mains. ✓
	NMT	Il a lavé ses mains. ✗
	Google	Il se lava les mains. ✓
S25b	Source	I brushed my teeth.
	Ref	Je me suis brossé les dents.
	PBMT-1	J’ai brossé mes dents. ✗
	NMT	J’ai brossé mes dents. ✗
	Google	Je me suis brossé les dents. ✓
S25c	Source	You brushed your teeth.
	Ref	Tu t’es brossé les dents
	PBMT-1	Vous avez brossé vos dents. ✗
	NMT	vous avez brossé vos dents. ✗
	Google	Tu as brossé les dents. ✗
S25d	Source	I raised my hand.
	Ref	J’ai levé la main.
	PBMT-1	J’ai levé la main. ✓
	NMT	J’ai soulevé ma main. ✗
	Google	Je levai la main. ✓
S25e	Source	He turned his head.
	Ref	Il a tourné la tête.
	PBMT-1	Il a transformé sa tête. ✗
	NMT	Il a tourné sa tête. ✗
	Google	Il tourna la tête. ✓
S25f	Source	He raised his eyes to heaven.
	Ref	Il leva les yeux au ciel.
	PBMT-1	Il a évoqué les yeux au ciel. ✓
	NMT	Il a levé les yeux sur le ciel. ✓
	Google	Il leva les yeux au ciel. ✓

Zero REL PRO
Is the English zero relative pronoun correctly translated as a non-zero one in the French translation?
S26a	Source	The strangers the woman saw were working.
	Ref	Les inconnus que la femme vit travaillaient.
	PBMT-1	Les étrangers la femme vit travaillaient. ✗
	NMT	Les inconnus de la femme ont travaillé. ✗
	Google	Les étrangers que la femme vit travaillaient. ✓
S26b	Source	The man your sister hates is evil.
	Ref	L’homme que votre sœur déteste est méchant.
	PBMT-1	L’homme ta soeur hait est le mal. ✗
	NMT	L’homme que ta soeur est le mal est le mal. ✓
	Google	L’homme que votre sœur hait est méchant. ✓
S26c	Source	The girl my friend was talking about is gone.
	Ref	La fille dont mon ami parlait est partie.
	PBMT-1	La jeune fille mon ami a parlé a disparu. ✗
	NMT	La petite fille de mon ami était révolue. ✗
	Google	La fille dont mon ami parlait est partie. ✓

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Challenge Set Approach to Evaluating Machine Translation

Pierre Isabelle and Colin Cherry

National Research Council Canada

[email protected]

\AndGeorge Foster

Google

[email protected] Work performed while at NRC.

Abstract

Neural machine translation represents an exciting leap forward in translation quality. But what longstanding weaknesses does it resolve, and which remain? We address these questions with a challenge set approach to translation evaluation and error analysis. A challenge set consists of a small set of sentences, each hand-designed to probe a system’s capacity to bridge a particular structural divergence between languages. To exemplify this approach, we present an English–French challenge set, and use it to analyze phrase-based and neural systems. The resulting analysis provides not only a more fine-grained picture of the strengths of neural systems, but also insight into which linguistic phenomena remain out of reach.

1 Introduction

The advent of neural techniques in machine translation (MT) Kalchbrenner and Blunsom (2013); Cho et al. (2014); Sutskever et al. (2014) has led to profound improvements in MT quality. For “easy” language pairs such as English/French or English/Spanish in particular, neural (NMT) systems are much closer to human performance than previous statistical techniques Wu et al. (2016). This puts pressure on automatic evaluation metrics such as BLEU Papineni et al. (2002), which exploit surface-matching heuristics that are relatively insensitive to subtle differences. As NMT continues to improve, these metrics will inevitably lose their effectiveness. Another challenge posed by NMT systems is their opacity: while it was usually clear which phenomena were ill-handled by previous statistical systems—and why—these questions are more difficult to answer for NMT.

We propose a new evaluation methodology centered around a challenge set of difficult examples that are designed using expert linguistic knowledge to probe an MT system’s capabilities. This methodology is complementary to the standard practice of randomly selecting a test set from “real text,” which remains necessary in order to predict performance on new text. By concentrating on difficult examples, a challenge set is intended to provide a stronger signal to developers. Although we believe that the general approach is compatible with automatic metrics, we used manual evaluation for the work presented here. Our challenge set consists of short sentences that each focus on one particular phenomenon, which makes it easy to collect reliable manual assessments of MT output by asking direct yes-no questions. An example is shown in Figure 1.

We generated a challenge set for English to French translation by canvassing areas of linguistic divergence between the two language pairs, especially those where errors would be made visible by French morphology. Example choice was also partly motivated by extensive knowledge of the weaknesses of phrase-based MT (PBMT). Neither of these characteristics is essential to our method, however, which we envisage evolving as NMT progresses. We used our challenge set to evaluate in-house PBMT and NMT systems as well as Google’s GNMT system.

In addition to proposing the novel idea of a challenge set evaluation, our contribution includes our annotated English–French challenge set, which we provide in both formatted text and machine-readable formats (see supplemental materials). We also supply further evidence that NMT is systematically better than PBMT, even when BLEU score differences are small. Finally, we give an analysis of the challenges that remain to be solved in NMT, an area that has received little attention thus far.

2 Related Work

A number of recent papers have evaluated NMT using broad performance metrics. The WMT 2016 News Translation Task Bojar et al. (2016) evaluated submitted systems according to both BLEU and human judgments. NMT systems were submitted to 9 of the 12 translation directions, winning 4 of these and tying for first or second in the other 5, according to the official human ranking. Since then, controlled comparisons have used BLEU to show that NMT outperforms strong PBMT systems on 30 translation directions from the United Nations Parallel Corpus Junczys-Dowmunt et al. (2016a), and on the IWSLT English-Arabic tasks Durrani et al. (2016). These evaluations indicate that NMT performs better on average than previous technologies, but they do not help us understand what aspects of the translation have improved.

Some groups have conducted more detailed error analyses. Bentivogli et al. (2016) carried out a number of experiments on IWSLT 2015 English-German evaluation data, where they compare machine outputs to professional post-edits in order to automatically detect a number of error categories. Compared to PBMT, NMT required less post-editing effort overall, with substantial improvements in lexical, morphological and word order errors. NMT consistently out-performed PBMT, but its performance degraded faster as sentence length increased. Later, Toral and Sánchez-Cartagena (2017) conducted a similar study, examining the outputs of competition-grade systems for the 9 WMT 2016 directions that included NMT competitors. They reached similar conclusions regarding morphological inflection and word order, but found an even greater degradation in NMT performance as sentence length increased, perhaps due to these systems’ use of subword units.

Most recently, Sennrich (2016) proposed an approach to perform targeted evaluations of NMT through the use of contrastive translation pairs. This method introduces a particular type of error automatically in reference sentences, and then checks whether the NMT system’s conditional probability model prefers the original reference or the corrupted version. Using this technique, they are able to determine that a recently-proposed character-based model improves generalization on unseen words, but at the cost of introducing new grammatical errors.

Our approach differs from these studies in a number of ways. First, whereas others have analyzed sentences drawn from an existing bitext, we conduct our study on sentences that are manually constructed to exhibit canonical examples of specific linguistic phenomena. We focus on phenomena that we expect to be more difficult than average, resulting in a particularly challenging MT test suite King and Falkedal (1990). These sentences are designed to dive deep into linguistic phenomena of interest, and to provide a much finer-grained analysis of the strengths and weaknesses of existing technologies, including NMT systems.

However, this strategy also necessitates that we work on fewer sentences. We leverage the small size of our challenge set to manually evaluate whether the system’s actual output correctly handles our phenomena of interest. Manual evaluation side-steps some of the pitfalls that can come with Sennrich (2016)’s contrastive pairs, as a ranking of two contrastive sentences may not necessarily reflect whether the error in question will occur in the system’s actual output.

3 Challenge Set Evaluation

Our challenge set is meant to measure the ability of MT systems to deal with some of the more difficult problems that arise in translating English into French. This particular language pair happened to be most convenient for us, but similar sets can be built for any language pair.

One aspect of MT performance excluded from our evaluation is robustness to sparse data. To control for this, when crafting source and reference sentences, we chose words that occurred at least 100 times in our training corpus (section 4.1).111With two exceptions: spilt (58 occurrences), which is part of an idiomatic phrase, and guitared (0 occurrences), which is meant to test the ability to deal with ”nonce words” as discussed in section 5.

The challenging aspect of the test set we are presenting stems from the fact that the source English sentences have been chosen so that their closest French equivalent will be structurally divergent from the source in some crucial way. Translational divergences have been extensively studied in the past—see for example Vinay and Darbelnet (1958); Dorr (1994). We expect the level of difficulty of an MT test set to correlate well with its density in divergence phenomena, which we classify into three main types: morpho-syntactic, lexico-syntactic and purely syntactic divergences.

3.1 Morpho-syntactic divergences

In some languages, word morphology (e.g. inflections) carries more grammatical information than in others. When translating a word towards the richer language, there is a need to recover additional grammatically-relevant information from the context of the target language word. Note that we only include in our set cases where the relevant information is available in the linguistic context.222The so-called Winograd Schema Challenges (en.wikipedia.org/wiki/Winograd_Schema_Challenge) often involve cases where common-sense reasoning is required to correctly choose between two potential antecedent phrases for a pronoun. Such cases become En $\rightarrow$ Fr translation challenges if the relevant English pronoun is they and its alternative antecedents happen to have different grammatical genders in French: they $\rightarrow$ ils/elles.

One particularly important case of morpho-syntactic divergence is that of subject–verb agreement. French verbs typically have more than 30 different inflected forms, while English verbs typically have 4 or 5. As a result, English verb forms strongly underspecify their French counterparts. Much of the missing information must be filled in through forced agreement in person, number and gender with the grammatical subject of the verb. But extracting these parameters can prove difficult. For example, the agreement features of a coordinated noun phrase are a complex function of the coordinated elements: a) the gender is feminine if all conjuncts are feminine, otherwise masculine wins; b) the conjunct with the smallest person (p1 $<$ p2 $<$ p3) wins; and c) the number is always plural when the coordination is “et” (“and”) but the case is more complex with “ou” (“or”).

A second example of morpho-syntactic divergence between English and French is the more explicit marking of the subjunctive mood in French subordinate clauses. In the following example, the verb “partiez”, unlike its English counterpart, is marked as subjunctive:

He demanded that you leave immediately. $\rightarrow$ Il a exigé que vous partiez immédiatement.

When translating an English verb within a subordinate clause, the context must be examined for possible subjunctive triggers. Typically these are specific lexical items found in a governing position with respect to the subordinate clause: verbs such as “exiger que”, adjectives such as “regrettable que” or subordinate conjunctions such as “à condition que”.

3.2 Lexico-syntactic divergences

Syntactically governing words such as verbs tend to impose specific requirements on their complements: they subcategorize for complements of a certain syntactic type. But a source language governor and its target language counterpart can diverge on their respective requirements. The translation of such words must then trigger adjustments in the target language complement pattern. We can only examine here a few of the types instantiated in our challenge set.

A good example is argument switching. This refers to the situation where the translation of a source verb Vs as Vt is correct but only provided the arguments (usually the subject and the object) are flipped around. The translation of “to miss” as “manquer à” is such a case:

John misses Mary $\rightarrow$ Mary manque à John.

Failing to perform the switch results in a severe case of mistranslation.

A second example of lexico-syntactic divergence is that of “crossing movement” verbs. Consider the following example:

Terry swam across the river $\rightarrow$ Terry a traversé la rivière à la nage.

The French translation could be glossed as, “Terry crossed the river by swimming.” A literal translation such as “Terry a nagé à travers la rivière,” is ruled out.

3.3 Syntactic divergences

Some syntactic divergences are not relative to the presence of a particular lexical item but rather stem from differences in the set of available syntactic patterns. Source-language instances of structures missing from the target language must be mapped onto equivalent structures. Here are some of the types appearing in our challenge set.

The position of French pronouns is a major case of divergence from English. French is basically an SVO language like English but it departs from that canonical order when post-verbal complements are pronominalized: the pronouns must then be rendered as proclitics, that is phonetically attached to the verb on its left side.

He gave Mary a book. $\rightarrow$ Il a donné un livre à Marie.

He gavei itj to herk. $\rightarrow$ Il lej luik a donnéi.

Another example of syntactic divergence between English and French is that of stranded prepositions. In both languages, an operation known as “WH-movement” will move a relativized or questioned element to the front of the clause containing it. When this element happens to be a prepositional phrase, English offers the option to leave the preposition in its normal place, fronting only its pronominalized object. In French, the preposition is always fronted alongside its object:

The girl whomi he was dancing withj is rich. $\rightarrow$ La fille avecj quii il dansait est riche.

A final example of syntactic divergence is the use of the so-called middle voice. While English uses the passive voice in agentless generic statements, French tends to prefer the use of a special pronominal construction where the pronoun “se” has no real referent:

Caviar is eaten with bread. $\rightarrow$ Le caviar se mange avec du pain.

This completes our exemplification of morpho-syntactic, lexico-syntactic and purely syntactic divergences. Our actual test set includes several more subcategories of each type. The ability of MT systems to deal with each such subcategory is then tested using at least three different test sentences. We use short test sentences so as to keep the targeted divergence in focus. The 108 sentences that constitute our current challenge set can be found in Appendix B.

3.4 Evaluation Methodology

Given the very small size of our challenge set, it is easy to perform a human evaluation of the respective outputs of a handful of different systems. The obvious advantage is that the assessment is then absolute instead of relative to one or a few reference translations.

The intent of each challenge sentence is to test one and only one system capability, namely that of coping correctly with the particular associated divergence subtype. As illustrated in Figure 1, we provide annotators with a question that specifies the divergence phenomenon currently being tested, along with a reference translation with the areas of divergence highlighted. As a result, judgments become straightforward: was the targeted divergence correctly bridged, yes or no?333Sometimes the system produces a translation that circumvents the divergence issue. For example, it may dodge a divergence involving adverbs by reformulating the translation to use an adjective instead. In these rare cases, we instruct our annotators to abstain from making a judgment, regardless of whether the translation is correct or not. There is no need to mentally average over a number of different aspects of the test sentence as one does when rating the global translation quality of a sentence, e.g. on a 5-point scale. However, we acknowledge that measuring translation performance on complex sentences exhibiting many different phenomena remains crucial. We see our approach as being complementary to evaluations of overall translation quality.

One consequence of our divergence-focused approach is that faulty translations will be judged as successes when the faults lie outside of the targeted divergence zone. However, this problem is mitigated by our use of short test sentences.

4 Machine Translation Systems

We trained state-of-the-art neural and phrase-based systems for English-French translation on data from the WMT 2014 evaluation.

4.1 Data

We used the LIUM shared-task subset of the WMT 2014 corpora,444http://www.statmt.org/wmt14/translation-task.html

http://www-lium.univ-lemans.fr/$\scriptstyle\sim$schwenk/nnmt-shared-task retaining the provided tokenization and corpus organization, but mapping characters to lowercase. Table 1 gives corpus statistics.

4.2 Phrase-based systems

To ensure a competitive PBMT baseline, we performed phrase extraction using both IBM4 and HMM alignments with a phrase-length limit of 7; after frequency pruning, the resulting phrase table contained 516M entries. For each extracted phrase pair, we collected statistics for the hierarchical reordering model of Galley and Manning Galley and Manning (2008).

We trained an NNJM model Devlin et al. (2014) on the HMM-aligned training corpus, with input and output vocabulary sizes of 64k and 32k. Words not in the vocabulary were mapped to one of 100 mkcls classes. We trained for 60 epochs of 20k $\times$ 128 minibatches, yielding a final dev-set perplexity of 6.88.

Our set of log-linear features consisted of forward and backward Kneser-Ney smoothed phrase probabilities and HMM lexical probabilities (4 features); hierarchical reordering probabilities (6); the NNJM probability (1); a set of sparse features as described by Cherry (2013) (10,386); word-count and distortion penalties (2); and 5-gram language models trained on the French half of the training corpus and the French monolingual corpus (2). Tuning was carried out using batch lattice MIRA Cherry and Foster (2012). Decoding used the cube-pruning algorithm of Huang and Chiang Huang and Chiang (2007), with a distortion limit of 7.

We include two phrase-based systems in our comparison: PBMT-1 has data conditions that exactly match those of the NMT system, in that it does not use the language model trained on the French monolingual corpus, while PBMT-2 uses both language models.

4.3 Neural systems

To build our NMT system, we used the Nematus toolkit,555https://github.com/rsennrich/nematus which implements a single-layer neural sequence-to-sequence architecture with attention Bahdanau et al. (2015) and gated recurrent units Cho et al. (2014). We used 512-dimensional word embeddings with source and target vocabulary sizes of 90k, and 1024-dimensional state vectors. The model contains 172M parameters.

We preprocessed the data using a BPE model learned from source and target corpora Sennrich et al. (2016). Sentences longer than 50 words were discarded. Training used the Adadelta algorithm Zeiler (2012), with a minibatch size of 100 and gradients clipped to 1.0. It ran for 5 epochs, writing a checkpoint model every 30k minibatches. Following Junczys-Dowmunt et al. (2016b), we averaged the parameters from the last 8 checkpoints. To decode, we used the AmuNMT decoder Junczys-Dowmunt et al. (2016a) with a beam size of 4.

While our primary results will focus on the above PBMT and NMT systems, where we can describe replicable configurations, we have also evaluated Google’s production system,666https://translate.google.com which has recently moved to NMT Wu et al. (2016). Notably, the “GNMT” system uses (at least) 8 encoder and 8 decoder layers, compared to our 1 layer for each, and it is trained on corpora that are “two to three decimal orders of magnitudes bigger than the WMT.” The evaluated outputs were downloaded in December 2016.

5 Experiments

The 108-sentence English–French challenge set presented in Appendix B was submitted to the four MT systems described in section 4: PBMT-1, PBMT-2, NMT, and GNMT. Three bilingual native speakers of French rated each translated sentence as either a success or a failure according to the protocol described in section 3.4. For example, the 26 sentences of the subcategories S1–S5 of Appendix B are all about different cases of subject-verb agreement. The corresponding translations were judged successful if and only if the translated verb correctly agrees with the translated subject.

The different system outputs for each source sentence were grouped together to reduce the burden on the annotators. That is, in figure 1, annotators were asked to answer the question for each of four outputs, rather than just one as shown. The outputs were listed in random order, without identification. Questions were also presented in random order to each annotator. Appendix A in the supplemental materials contains the instructions shown to the annotators.

5.1 Quantitative comparison

Table 2 summarizes our results in terms of percentage of successful translations, globally and over each main type of divergence. For comparison with traditional metrics, we also include BLEU scores measured on the WMT 2014 test set.

As we can see, the two PBMT systems fare very poorly on our challenge set, especially in the morpho-syntactic and purely syntactic types. Their somewhat better handling of lexico-syntactic issues probably reflects the fact that PBMT systems are naturally more attuned to lexical cues than to morphology or syntax. The two NMT systems are clear winners in all three categories. The GNMT system is best overall with a success rate of 68%, likely due to the data and architectural factors mentioned in section 4.3.777We cannot offer a full comparison with the pre-NMT Google system. However, in October 2016 we ran a smaller 35-sentence version of our challenge set on both the Google system and our PBMT-1 system. The Google system only got 4 of those examples right (11.4%) while our PBMT-1 got 6 right (17.1%).

WMT BLEU scores correlate poorly with challenge-set performance. The large gap of 2.3 BLEU points between PBMT-1 and PBMT-2 corresponds to only a 1% gain on the challenge set, while the small gap of 0.4 BLEU between PBMT-2 and NMT corresponds to a 21% gain.

Inter-annotator agreement (final column in table 2) is excellent overall, with all three annotators agreeing on almost 90% of system outputs. Syntactic divergences appear to be somewhat harder to judge than other categories.

5.2 Qualitative assessment of NMT

We now turn to an analysis of the strengths and weaknesses of neural MT through the microscope of our divergence categorization system, hoping that this may help focus future research on key issues. In this discussion we ignore the results obtained by PBMT-2 and compare: a) the results obtained by PBMT-1 to those of NMT, both systems having been trained on the same dataset; and b) the results of these two systems with those of Google NMT which was trained on a much larger dataset.

In the remainder of the present section we will refer to the sentences of our challenge set using the subcategory-based numbering scheme S1-S26 as assigned in Appendix B. A summary of the category-wise performance of PBMT-1, NMT and Google NMT is provided in Table 3.

Strengths of neural MT

Overall, both neural MT systems do much better than PBMT-1 at bridging divergences. In the case of morpho-syntactic divergences, we observe a jump from 16% to 72% in the case of our two local systems. This is mostly due to the NMT system’s ability to deal with many of the more complex cases of subject-verb agrement:

•

Distractors. The subject’s head noun agreement features get correctly passed to the verb phrase across intervening noun phrase complements (sentences S1a–c).

•

Coordinated verb phrases. Subject agreement marks are correctly distributed across the elements of such verb phrases (S3a–c).

•

Coordinated subjects. Much of the logic that is at stake in determining the agreement features of coordinated noun phrases (cf. our relevant description in section 3.1) appears to be correctly captured in the NMT translations of S4.

•

Past participles. Even though the rules governing French past participle agreement are notoriously difficult (especially after the “avoir” auxiliary), they are fairly well captured in the NMT translations of (S5b–e).

The NMT systems are also better at handling lexico-syntactic divergences. For example:

•

Double-object verbs. There are no such verbs in French and the NMT systems perform the required adjustments flawlessly (sentences S8a–S8c).

•

Overlapping subcat frames. NMT systems manage to discriminate between an NP complement and a sentential complement starting with an NP: cf. to know NP versus to know NP is VP (S11b–e)

•

NP-to-VP complements. These English infinitival complements often need to be rendered as finite clauses in French and the NMT systems are better at this task (S12a–c).

Finally, NMT systems also turn out to better handle purely syntactic divergences. For example:

•

Yes-no question syntax. The differences between English and French yes-no question syntax are correctly bridged by the two NMT systems (S17a–c).

•

French proclitics. NMT systems are significantly better at transforming English pronouns into French proclitics, i.e. moving them before the main verb and case-inflecting them correctly (S23a–e).

•

Finally, we note that the Google system manages to overcome several additional challenges. It correctly translates tag questions (S18a–c), constructions with stranded prepositions (S19a–f), most cases of the inalienable possession construction (S25a–e) as well as zero relative pronouns (S26a–c).

The large gap observed between the results of the in-house and Google NMT systems indicates that current neural MT systems are extremely data hungry. But given enough data, they can successfully tackle some challenges that are often thought of as extremely difficult. A case in point here is that of stranded prepositions (see discussion in section 3.3), in which we see the NMT model capture some instances of WH-movement, the textbook example of long-distance dependencies.

Weaknesses of neural MT

In spite of its clear edge over PBMT, NMT is not without some serious shortcomings. We already mentioned the degradation issue with long sentence which, by design, could not be observed with our challenge set. But an analysis of our results will reveal many other problems. Globally, we note that even using a staggering quantity of data and a highly sophisticated NMT model, the Google system fails to reach the 70% mark on our challenge set. The fine-grained error categorization associated with the challenge set will help us single out precise areas where more research is needed. Here are some relevant observations.

Incomplete generalizations. In several cases where partial results might suggest that NMT has correctly captured some basic generalization about linguistic data, further instances reveals that this is not fully the case.

•

Agreement logic. The logic governing the agreement features of coordinated noun phrases (see section 3.1) has been mostly captured by the NMT systems (cf. the 12 sentences of S4), but there are some gaps. For example, the Google system runs into trouble with mixed-person subjects (sentences S4d1–3).

•

Subjunctive mood triggers. While some subjunctive mood triggers are correctly registered (e.g. “demander que” and “malheureux que”), the case of such a highly frequent subordinate conjunction as provided that $\rightarrow$ à condition que is somehow being missed (sentence S6a–c).

•

Noun compounds. The French translation of an English compound N1 N2 is usually of the form N2 Prep N1. For any given headnoun N2 the correct preposition Prep depends on the semantic class of N1. For example steel/ceramic/plastic knife $\rightarrow$ couteau en acier/céramique/plastique but butter/meat/steak knife $\rightarrow$ couteau à beurre/viande/steak. Given that neural models are known to perform some semantic generalizations, we find their performance disappointing on our compound noun examples (S14a–i).

•

The so-called French “inalienable possession” construction arises when an agent performs an action on one of her body parts, e.g. I brushed my teeth. The French translation will normally replace the possessive article with a definite one and introduce a reflexive pronoun, e.g. Je me suis brossé les dents (’I brushed myself the teeth’). In our dataset, the Google system gets this right for examples in the first and third persons (sentences S25a,b) but fails to do the same with the example in the second person (sentence S25c).

Then there are also phenomena that current NMT systems, even with massive amounts of data, appear to be completely missing:

•

Common and syntactically flexible idioms. While PBMT-1 produces an acceptable translation for half of the idiomatic expressions of S15 and S16, the local NMT system misses them all and the Google system does barely better. NMT systems appear to be short on raw memorization capabilities.

•

Control verbs. Two different classes of verbs can govern a subject NP, an object NP plus an infinitival complement. With verbs of the “object-control” class (e.g. “persuade”), the object of the verb is understood as the semantic subject of the infinitive. But with those of the “subject-control” class (e.g. “promise”), it is rather the subject of the verb which plays that semantic role. None of the systems tested here appear to get a grip on subject control cases, as evidenced by the lack of correct feminine agreement on the French adjectives in sentences S2b–d.

•

Argument switching verbs. All systems tested here mistranslate sentences S7a–c by failing to perform the required argument switch: NP1 misses NP2 $\rightarrow$ NP2 manque à NP1.

•

Crossing movement verbs. None of the systems managed to correctly restructure the regular manner-of-movement verbs e.g. swim across X $\rightarrow$ traverser X à la nage in sentences S10a-c. Unsurprisingly, all systems also fail on the even harder example S10d, in which the “nonce verb” guitared is a spontaneous derivation from the noun guitar being cast as an ad hoc manner-of-movement verb. 888 On the concept of nonce word, see https://en.wikipedia.org/wiki/Nonce_word.

•

Middle voice. None of the systems tested here were able to recast the English “generic passive” of S21a–c into the expected French “middle voice” pronominal construction.

6 Conclusions

We have presented a radically different kind of evaluation for MT systems: the use of challenge sets designed to stress-test MT systems on “hard” linguistic material, while providing a fine-grained linguistic classification of their successes and failures. This approach is not meant to replace our community’s traditional evaluation tools but to supplement them.

Our proposed error categorization scheme makes it possible to bring to light different strengths and weaknesses of PBMT and neural MT. With the exception of idiom processing, in all cases where a clear difference was observed it turned out to be in favor of neural MT. A key factor in NMT’s superiority appears to be its ability to overcome many limitations of $n$ -gram language modeling. This is clearly at play in dealing with subject-verb agreement, double-object verbs, overlapping subcategorization frames and last but not least, the pinnacle of Chomskyan linguistics, WH-movement (in this case, stranded prepositions).

But our challenge set also brings to light some important shortcomings of current neural MT, regardless of the massive amounts of training data it may have been fed. As may have been already known or suspected, NMT systems struggle with the translation of idiomatic phrases. Perhaps more interestingly, we notice that neural MT’s impressive generalizations still seem somewhat brittle. For example, the NMT system can appear to have mastered the rules governing subject-verb agreement or inalienable possession in French, only to trip over a rather obvious instantiation of those rules. Probing where these boundaries are, and how they relate to the neural system’s training data and architecture is an obvious next step.

7 Future Work

It is our hope that the insights derived from our challenge set evaluation will help inspire future MT research, and call attention to the fact that even “easy” language pairs like English–French still have many linguistic issues left to be resolved. But there are also several ways to improve and expand upon our challenge set approach itself.

First, though our human judgments of output sentences allowed us to precisely assess the phenomena of interest, this approach is not scalable to large sets, and requires access to native speakers in order to replicate the evaluation. It would be interesting to see whether similar scores could be achieved through automatic means. The existence of human judgments for this set provides a gold-standard by which proposed automatic judgments may be meta-evaluated.

Second, the construction of such a challenge set requires in-depth knowledge of the structural divergences between the two languages of interest. A method to automatically create such a challenge set for a new language pair would be extremely useful. One could imagine approaches that search for divergences, indicated by atypical output configurations, or perhaps by a system’s inability to reproduce a reference from its own training data. Localizing a divergence within a difficult sentence pair would be another useful subtask.

Finally, we would like to explore how to train an MT system to improve its performance on these divergence phenomena. This could take the form of designing a curriculum to demonstrate a particular divergence to the machine, or altering the network structure to capture such generalizations.

Acknowledgments

We would like to thank Cyril Goutte, Eric Joanis and Michel Simard, who graciously spent the time required to rate the output of four different MT systems on our challenge sentences. We also thank Roland Kuhn for valuable discussions, and comments on an earlier version of the paper.

Appendix A Instructions to Annotators

The following instructions were provided to annotators:

You will be presented with 108 short English sentences and the French translations produced for them by each of four different machine translation systems. You will not be asked to provide an overall rating for the machine-translated sentences. Rather, you will be asked to determine whether or not a highly specific aspect of the English sentence is correctly rendered in each of the different translations. Each English sentence will be accompanied with a yes-no question which precisely specifies the targeted element for the associated translations. For example, you may be asked to determine whether or not the main verb phrase of the translation is in correct grammatical agreement with its subject.

In order to facilitate this process, each English sentence will also be provided with a French reference (human) translation in which the particular elements that support a yes answer (in our example, the correctly agreeing verb phrase) will be highlighted. Your answer should be “yes” if the question can be answered positively and “no” otherwise. Note that this means that any translation error which is unrelated to the question at hand should be disregarded. Using the same example: as long as the verb phrase agrees correctly with its subject, it does not matter whether or not the verb is correctly chosen, is in the right tense, etc. And of course, it does not matter if unrelated parts of the translation are wrong.

*In most cases you should be able to quickly determine a positive or negative answer. However, there may be cases in which the system has come up with a translation that just does not contain the phenomenon targeted by the associated question. In such cases, and only in such cases, you should choose “not applicable” regardless of whether or not the translation is correct. *

Appendix B Challenge Set

We include a rendering of our challenge set in the pages that follow, along with system output for the PBMT-1, NMT and Google systems.999A machine-readable version is provided in the file Challenge_set-v2hA.json in the supplemental materials. Sentences are grouped by linguistic category and subcategory. For convenience, we also include a reference translation, which is a manually-crafted translation that is designed to be the most straightforward solution to the divergence problem at hand. Needless to say, this reference translation is seldom the only acceptable solution to the targeted divergence problem. Our judges were provided these references, but were instructed to use their knowledge of French to judge whether the divergence was correctly bridged, regardless of the translation’s similarity to the reference.

In all translations, the locus of the targeted divergence is highlighted in boldface and it is specifically on that portion that our annotators were asked to provide a judgment. For each system output, we provide a summary of our annotator’s judgments on its handling of the phenomenon of interest. We label the translation with a ✓ if two or more annotators judged the divergence to be correctly bridged, and with an ✗ otherwise.

We also release a machine-readable version of this same data, including all of the individual judgments, in the hope that others will find interesting new uses for it.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the Third International Conference on Learning Representations (ICLR) . San Diego, USA. http://arxiv.org/abs/1409.0473 .
2Bentivogli et al. (2016) Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. 2016. Neural versus phrase-based machine translation quality: a case study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, Austin, Texas, pages 257–267. https://aclweb.org/anthology/D 16-1025 .
3Bojar et al. (2016) Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation . Assoc
4Cherry (2013) Colin Cherry. 2013. Improved reordering for phrase-based translation using sparse features. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics, Atlanta, Georgia, pages 22–31. http://www.aclweb.org/anthology/N 13-1003 .
5Cherry and Foster (2012) Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics, Montréal, Canada, pages 427–436. http://www.aclweb.org/anthology/N 12-1047 .
6Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics, Doha, Qatar, pages 1724–1734. http://www.aclweb.org/anthology/D 14-1179 .
7Devlin et al. (2014) Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Baltimore, Maryland, pages 1370–1380. http://www.aclweb.org/anthology/P 14-1129 .
8Dorr (1994) Bonnie J. Dorr. 1994. Machine translation divergences: a formal description and proposed solution. Computational Linguistics 20:4. http://aclweb.org/anthology/J/J 94/J 94-4004.pdf .