Trends in Integration of Vision and Language Research: A Survey of   Tasks, Datasets, and Methods

Aditya Mogadala; Marimuthu Kalimuthu; Dietrich Klakow

arXiv:1907.09358·cs.CV·January 4, 2022

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Aditya Mogadala, Marimuthu Kalimuthu, Dietrich Klakow

PDF

2 Datasets

TL;DR

This survey reviews ten key vision-and-language tasks, analyzing their problem formulations, datasets, methods, and results, aiming to guide future research and innovation in the integration of these AI sub-fields.

Contribution

It provides a comprehensive comparison of tasks, datasets, and methods in vision-language integration, extending beyond previous surveys by covering multiple content types and offering future directions.

Findings

01

Comparison of state-of-the-art methods across tasks

02

Analysis of datasets and evaluation measures

03

Identification of challenges and future research directions

Abstract

Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video.…

Tables147

Table 1. Table 1: Summary of methods for generating a global description of an image. Approaches are segregated based on their usage of no-attention, attention, and RL techniques.

Approach	Attention	RL
MLBL (?)	✗	✗
m-RNN (?)	✗	✗
Minds Eye (?)	✗	✗
BRNN (?)	✗	✗
NIC (?)	✗	✗
LRCN (?)	✗	✗
Guided LSTM (?)	✗	✗
Deep Bidirectional LSTM (?)	✗	✗
Regional Visual Attributes (?)	✗	✗
Language CNN (?)	✗	✗
ConceptNet-NIC (?)	✗	✗
Visual Attention (?)	✓	✗
Region-based Attention (?)	✓	✗
Attribute Attention (?)	✓	✗
Review Attention (?)	✓	✗
Adaptive Attention (?)	✓	✗
Areas of Attention (?)	✓	✗
Contrastive Adaptive Attention (?)	✓	✗
Neural Baby Talk w/ Attention (?)	✓	✗
Convolutional Attention (?)	✓	✗
Reflective Decoding Network (?)	✓	✗
Self-Critical Attention (?)	✓	✓
Policy Gradient (?)	✓	✓
Up-Down (?)	✓	✓
Multi-task Captioning (?)	✓	✓
Stack Captioning (?)	✓	✓
Attention on Attention (?)	✓	✓
Meshed-Memory Transformer (?)	✓	✓

Table 2. Table 2: Basic statistics of the SBU1M image description dataset.

Total Images	Captions per Image	Total Captions	Object Categories
1,000,000	1	1,000,000	89

Table 3. Table 3: Splits of the Flickr8k image description dataset.

Split	Images	Captions per Image	Total Captions
Training	6,000	5	30,000
Validation	1,000	5	5,000
Test	1,000	5	5,000
Total	8,000	5	40,000

Table 4. Table 4: Splits of the Flickr30k image description dataset.

Split	Images	Captions per Image	Total Captions
Training	29,000	5	145,000
Validation	1,014	5	5,070
Test	1,000	5	5,000
Total	31,014	5	155,070

Table 5. Table 5: Splits and statistics of the Flickr30k-Entities image description dataset.

	Num. of	Object	Objects	Objects	Captions	Total
Split	Images	Categories	per Category	per Image	per Image	Captions
Training	29,783	-	-	-	5	148,915
Validation	1,000	-	-	-	5	5,000
Test	1,000	-	-	-	5	5,000
Total	31,783	44,518	6.2	8.7	5	158,915

Table 6. Table 6: Splits of the MSCOCO image description dataset.

Split	Images	Captions per Image	Total Captions	Object Categories
Training	113,287	5	566,435	-
Validation	5,000	5	25,000	-
Test	5,000	5	25,000	-
Total	123,287	5	616,435	80

Table 7. Table 7: Splits and statistics of the MSCOCO-Entities image description dataset.

Split	Images	Total Captions	Noun chunks	Noun chunks per caption	Unique Classes
Training	113,287	545,202	1,518,667	2.79	1,330
Validation	5,000	7,818	20,787	2.66	725
Test	5,000	7,797	20,596	2.64	730

Table 8. Table 8: Statistics of the STAIR Captions image description dataset (Japanese). Details on the public part of the dataset is indicated in brackets.

Total Num.	Captions	Total Num.	Vocabulary	Avg. Number
of Images	per Image	of Captions	Size	of Chars
164,062 (123,287)	5	820,310 (616,435)	35,642 (31,938)	23.79 (23.80)

Table 9. Table 9 : Splits and statistics of the Multi30k-CLID (2016) dataset.

		Language of the Captions
Split	Images	English	German
Training	29,000	145,000	145,000
Validation	1,014	5,070	5,070
Testing	1,000	5,000	5,000

Table 10. Table 10 : Splits and statistics of the Multi30k-CLID (2017) dataset.

		Language of the Captions
Split	Images	English	French	German
Training	29,000	145,000	145,000	145,000
Validation	1,014	5,070	5,070	5,070
Testing	1,000	5,000	5,000	5,000

Table 11. Table 11 : Splits and statistics of the Multi30k-CLID (2018) dataset.

		Language of the Captions
Split	Images	Czech	English	French	German
Training	29,000	145,000	145,000	145,000	145,000
Validation	1,014	5,070	5,070	5,070	5,070
Testing	1,071	5,355	5,355	5,355	5,355

Table 12. Table 12: Splits of the Conceptual Captions dataset.

Split	Images	Captions
Training	3,318,333	3,318,333
Validation	15,840	15,840
Test	22,530	22,530

Table 13. Table 13: Splits and statistics of the Personality Captions dataset.

	Num. of	Captions	Num. of	Personality	Vocabulary	Avg. Tokens
Split	Images	per Image	Captions	Types	Size	per Caption
Training	186,858	1	186,858	215	33,641	11.2
Validation	5,000	1	5,000	215	5,460	10.9
Test	10,000	5	50,000	215	16,655	11.1

Table 14. Table 14 : Statistics of the MSVD dataset.

Total	Total	Total	Avg.	Total	Total	Total	Vocabulary
Videos	Classes	Length	Length	Clips	Sentences	Words	Size
1,970	218	5.3 h	10 s	1,970	70,028	607,339	13,010

Table 15. Table 15: Splits of the MSVD dataset.

Split	Frames	Videos
Training	33,682	1,200
Validation	3,275	100
Test	20,528	670
Total	57,485	1970

Table 16. Table 16: Statistics of the MPII Cooking Activities dataset.

Num. of	Total	Total	Total	Video	Total	Num. of	Total	Activity
Subjects	Clips	Videos	Frames	Length	Length	Activities	Dishes	Annotations
12	5,609	44	881,755	3 to 41 m	8.0 h	65	14	5,609

Table 17. Table 17: Splits of the MPII Cooking dataset.

Split	Frames	Subjects
Training	1,071	10
Validation	-	-
Test	1,277	7

Table 18. Table 18: Statistics of the YouCook dataset.

Cooking	Object	Total	Total	Num. of	Num. of	Vocabulary
Styles	Classes	Videos	Length	Sentences	Words	Size
6	10	88	2.3 h	2,688	42,457	2,711

Table 19. Table 19: Splits of the YouCook dataset.

Split	Videos
Training	49
Validation	-
Test	39

Table 20. Table 20: Statistics of the YouCook II dataset.

Cooking	Total	Total Video	Avg. Video	Procedure	Total	Num. of	Vocab.
Recipes	Videos	Length	Length	Seg. per Video	Clips	Sentences	Size
89	2,000	175.6 h	316 s	3-16	15,400	15,400	2,600

Table 21. Table 21: Splits of the YouCook II dataset.

Split	Videos
Training	1,340
Validation	460
Test	200

Table 22. Table 22: The TACoS dataset statistics - I.

Total	Total	Descriptions	Annotation	Annotations	Cooking	Action
Videos	Clips	per Video	Assignments	after filtering	Tasks/Dishes	Descriptions
127	7,206	20	2,540	2,206	26	17,334 (tokens)

Table 23. Table 23: The TACoS dataset statistics - II.

Sentence	Total	Content Words	Num. of	Num. of
Types	Words	(viz. nouns, verbs, adjectives)	Verbs (tokens)	Verbs (lemmas)
11,796	146,771	75,210	28,292	435

Table 24. Table 24: Statistics of the TACoS-MultiLevel dataset.

Total	Total	Total Video	Avg.	Number of	Total
Videos	Clips	Length	Length	Sentences	Words
185	14,105	27.1 h	360 s	52,593	2,000

Table 25. Table 25: Statistics of the MPII-MD dataset.

	Unique	Before alignment	After alignment
	Movies	Words	Words	Sentences	Clips	Avg. Length	Total
Audio Desc.	55	346,557	332,846	37,272	37,266	4.1 s	42.5 h
Movie script	50	398,072	320,621	31,103	31,071	3.6 s	31.1 h
Total	94	744,629	653,467	68,375	68,337	3.9 s	73.6 h

Table 26. Table 26: Statistics of the M-VAD dataset.

Type	Movies	Words	Paragraphs	Sentences	Avg. Length	Total
Un-filtered	92	531,778	52,683	59,415	6.3 s	91 h
Filtered	92	510,933	48,986	55,904	6.2 s	84.6 h

Table 27. Table 27: Splits of the M-VAD dataset.

Split	Video Clips
Training	38,949
Validation	4,888
Test	5,149

Table 28. Table 28: Statistics of the MSR-VTT dataset.

Categories	Videos	Clips	Sentences per Clip	Sentences	Words	Vocab.	Duration
20	7,180	10,000	20	200,000	1,856,523	29,316	41.2 h

Table 29. Table 29: Splits of the MSR-VTT dataset.

Split	Video Clips
Training	6,513
Validation	497
Test	2,990

Table 30. Table 30: Statistics of the VTW dataset.

Dataset	Sentences	Vocab.	Sentences/Word	Nouns	Verbs	Adjective	Adverb
VTW-title	18,100	8,874	2.0	5,850	2,187	1,187	224
VTW-full	44,603	23,059	1.9	13,606	6,223	3,967	846

Table 31. Table 31: Splits of the VTW dataset.

Split	Videos	Sentences/Titles
Training	14,100	14,100
Validation	2,000	2,000
Test	2,000	2,000

Table 32. Table 32: Statistics of the ANetCap dataset.

Videos	Total Video Hours	Avg. Video Length	Sentences	Avg. Sentence Length
20,000	849	180 s	100,000	13.48 (words)

Table 33. Table 33: Splits of the ANetCap dataset.

Split	Videos
Training	10,024
Validation	4,926
Test	5,044

Table 34. Table 34: Statistics and splits of the ANetEntities dataset.

Split	Videos	Sentences	Objects	Bounding Boxes
Training	10,000	35,000	432	105,000
Validation	2,500	8,600	427	26,500
Test	2,500	8,500	421	26,100
Total	15,000	52,100	432	157,600

Table 35. Table 35: Statistics of the COIN dataset.

Num. of	Num. of	Total	Total	Total	Avg. Video	Avg. Segment
Domains	Tasks	Videos	Segments	Duration	Length	Length
12	180	11,827	46,354	476 h, 38 m	2.36 m	14.91 s

Table 36. Table 36: Splits of the COIN dataset.

Split	Videos
Training	9,030
Validation	-
Test	2,797

Table 37. Table 37: Statistics of the HowTo100M dataset.

Num. of	Num. of	Total	Total	Total	Total	Avg. Video	Avg. Clip-Caption
Domains	Tasks	Videos	Clips	Duration	Captions	Length	Pairs per Video
12	23,611	1.221M	136M	134,472 h	136M	6.5 m	110

Table 38. Table 38: Statistics of the NYC-Storytelling dataset.

Images	Blog posts
78,467	11,863

Table 39. Table 39: Statistics of the Disneyland-Storytelling dataset.

Images	Blog posts
60,545	7,717

Table 40. Table 40: Statistics of the SIND dataset.

	Images	Flickr Albums	(Text, Image)	Vocab
DII	-	-	151,800	13,800
DIS	-	-	151,800	5,000
SIS	-	-	252,900	18,200
Total	210,819	10,117	-	-

Table 41. Table 41: Statistics of the VIST (SIND v.2) dataset.

Images	Text Sequences
81,743	10,117

Table 42. Table 42: Splits of the VIST dataset.

Split	Stories	Sentences
Training	40,155	200,775
Validation	4,990	24,950
Test	5,055	25,275

Table 43. Table 43: Exemplar Image Storytelling architectures.

Approach	Image	Language	Combined	Optimizer	RL
(?)	AlexNet	LM	MLBL	-	✗
(?)	VGG	RNN	NeuralTalk	RMSprop	✗
(?)	GoogLeNet	LSTM	NIC	SGD	✗
(?)	VGG	RNN	CRCN	RMSprop	✗
(?)	VGG	GRU	Story-Flat	-	✗
(?)	VGG	LSTM	HierarchicalRNN	ADAM	✗
(?)	VGG	LSTM	BARNN	-	✗
(?)	VGG	LSTM	GAN	ADAM	✓
(?)	ResNet-152	GRU	AREL	ADAM	✓

Table 44. Table 44: Results of different models on the NYC-Storytelling dataset.

Model	B-4	CIDEr	METEOR	R@1	R@5	MedRank
MLBL (?)	0.01	2.6	5.29	1.19	4.52	100.5
NeuralTalk (?)	0.00	0.5	1.34	0.48	2.86	120.5
NIC (?)	0.10	9.1	5.73	0.95	7.38	88.5
CRCN (?)	2.08	30.9	7.69	11.67	31.19	14.00
Story-Flat (?)	-	-	7.37	-	-	-
HierarchialRNN (?)	-	-	6.07	-	-	-
BARNN (?)	-	41.6	-	29.37	45.43	8
AREL (?)	-	-	8.39	-	-	-

Table 45. Table 45: Results of various models on the Disneyland-Storytelling dataset.

Model	B-4	CIDEr	METEOR	R@1	R@5	MedRank
MLBL (?)	0.01	3.4	4.99	1.02	4.08	62
NeuralTalk (?)	0.00	0.4	1.34	1.02	3.40	88
NIC (?)	0.07	10.0	4.51	2.83	10.38	61.5
CRCN (?)	3.49	52.7	8.78	14.29	31.29	16
Story-Flat (?)	-	-	7.61	-	-	-
HierarchialRNN (?)	-	-	7.72	-	-	-
BARNN (?)	-	54.1	-	35.01	49.07	6
AREL (?)	-	-	9.90	-	-	-

Table 46. Table 46: Results of different models on the SIND dataset.

Model	B-4	CIDEr	METEOR	R@1	R@5	MedRank
CRCN (?)	-	-	-	9.87	28.74	21
Story-Flat (?)	3.50	6.84	10.25	-	-	-
HierarchialRNN (?)	3.7	6.51	9.97	-	-	-
AREL (?)	5.16	11.35	12.32	-	-	-

Table 47. Table 47: Results of various models on the VIST dataset.

Model	B-4	CIDEr	METEOR	R@1	R@5	MedRank
enc-attn-dec (?)	-	4.96	32.98	-	-	-
h-attn-rank (?)	-	7.38	33.94	-	-	-
BARNN (?)	-	-	33.32	24.07	44.29	9
AREL-t-100 (?)	14.1	9.4	35.0	-	-	-

Table 48. Table 48: Statistics of the VideoStory dataset.

Total	Total	Total	Avg. Video	Total	Sentences
Videos	Length	Clips	Duration	Sentences	per Video
20,000	396 h	123,000	70s	123,000	4.67

Table 49. Table 49: Splits of the VideoStory dataset.

Split	Videos	Clips	Paragraphs/video	Paragraphs	Words/paragraph
Training	17,098	80,598	1	17,098	61.76
Validation	999	13,796	3	2,997	59.88
Testing	1,011	14,093	3	3,033	59.77
Test (Blind)	1,039	14,139	3	3,117	69.45
Total	20,147	122,626	-	26,245	62.23

Table 50. Table 50: Statistics of the VideoStory-NUS dataset.

		Avg. Video	Avg. Story	Avg. Sentence	Vocab.
Domain	Videos	Length	Length	Length	Size
Open	105	12 m 35 s	162.6	12.1	4,045

Table 51. Table 51: Splits of the VideoStory-NUS dataset.

Split	Percentage (%)	Videos
Training	70	73
Validation	15	16
Test	15	16

Table 52. Table 52: Exemplar Video Storytelling architectures.

Approach	Video	Frame	Language	Combined	Optimizer	RL
(?)	C3D	VGG	GRU	H-RNN	RMSProp	✗
(?)	R3D	ResNet-101	GRU	seq-seq+context	ADAM	✗
(?)	-	ResNet-101	GRU	ResBRNN	ADAM	✓

Table 53. Table 53: Results obtained with different models on the VideoStory dataset.

Model	B-4	CIDEr	METEOR	R@1	R@5	MedRank
seq-seq+context (?)	1.20	9.37	33.88	-	-	-

Table 54. Table 54: Results obtained with different models on the VideoStory-NUS dataset.

Model	B-4	CIDEr	METEOR	R@1	R@5	MedRank
mRNN (?)	11.8	81.3	18.0	5.34	21.23	29
Deep Video-Text (?)	11.5	79.5	17.7	4.72	19.85	31
H-RNN (?)	16.1	64.6	15.5	-	-	-
ResBRNN (?)	14.7	94.3	19.6	7.44	25.77	22
ResBRNN-kNN (?)	15.6	103.6	20.1	-	-	-

Table 55. Table 55: Statistics of the RefCLEF dataset.

Real	Distinct	Referring	Train/Test
Images	Objects	Expressions	Splits
19,894	96,654	130,525	Per-Image split

Table 56. Table 56: Statistics of the RefCOCO dataset.

	Total	Referring	Train/Test
Images	Objects	Expressions	Splits
19,994	50,000	142,209	People vs. Object

Table 57. Table 57: Statistics of the RefCOCO+ dataset.

	Total	Referring	Train/Test
Images	Objects	Expressions	Splits
19,992	49,856	141,564	People vs. Object

Table 58. Table 58: Statistics of the RefCOCOg dataset.

	Total	Referring	Train/Test
Images	Objects	Expressions	Splits
26,711	54,822	85,474	Per-Object

Table 59. Table 59: Statistics of “GuessWhat” dataset. The row ‘Full’ means all the dialogues are included, ‘Finished’ means all finished dialogues (successful and unsuccessful) are included, and ‘Success’ means only successful dialogues are included.

Dataset Type	Images	Objects	Dialogues	Questions	Words	Vocab. Size
Full	66,537	134,073	155,280	821,889	3,986,192	11,465
Finished	65,112	125,349	144,434	732,081	3,540,497	10,985
Success	62,954	114,271	131,394	648,493	3,125,219	10,469

Table 60. Table 60: Splits of the CLEVR-Ref+ dataset.

Split	Images	Referring Expressions
Training	70,000	700,000
Validation	15,000	150,000
Test	15,000	150,000

Table 61. Table 61: Exemplar Image Referring Expression and Comprehension architectures.

Approach	Image	Language	Combined	Optimizer	RL
(?)	VGG	LSTM	MMI	SGD	✗
(?)	VGG	LSTM	Neg. Bag	SGD	✗
(?)	VGG	LSTM	Context	-	✗
(?)	VGG	BiLSTM	CG	ADAM	✗
(?)	VGG	LSTM	Combined	ADAM	✗
(?)	VGG	LSTM	CMN	-	✗
(?)	VGG	LSTM	Reinforcer	ADAM	✓
(?)	VGG	BiLSTM	VarContext	SGD	✓
(?)	VGG	LSTM	AccumulateAtt	SGD	✗
(?)	VGG	LSTM	ParallelAtt	ADAM	✗
(?)	ResNet-101	BiLSTM	MAttNet	-	✗
(?)	ResNet-101	BiLSTM	RVG-Tree	ADAM	✗
(?)	ResNet-101	BiLSTM	CMRIN	ADAM	✗

Table 62. Table 62 : Comparison of Precision@1 (%) scores of different methods on RefCOCO.

	RefCOCO
Model	val	testA	testB
MMI (?)	-	63.15	64.21
Neg. Bag (?)	76.90	75.60	78.00
Context (?)	76.18	74.39	77.30
CG (?)	-	74.04	73.43
Attributes (?)	-	78.85	78.07
CMN (?)	-	75.94	79.57
Reinforcer (?)	79.56	78.95	80.22
VarContext (?)	-	78.98	82.39
AccumulateAtt (?)	81.27	81.17	80.01
ParallelAtt (?)	81.67	80.81	81.32
MAttNet+ResNet-101 (?)	85.65	85.26	84.57
RVG-Tree+ResNet-101 (?)	83.48	82.52	82.90
CMRIN+ResNet-101 (?)	86.99	87.63	84.73

Table 63. Table 63 : Comparison of Precision@1 (%) scores of different methods on the RefCOCO+ and RefCOCOg datasets.

	RefCOCO+			RefCOCOg
Model	val	testA	testB	val	test
MMI (?)	-	48.73	42.13	-	-
Neg Bag (?)	-	-	-	-	68.40
Context (?)	58.94	61.29	56.24	-	-
CG (?)	-	60.26	55.03	-	-
Attributes (?)	-	61.47	57.22	-	-
CMN (?)	-	59.29	59.34	-	-
Reinforcer (?)	62.26	64.60	59.62	71.65	71.92
VariationalContext (?)	-	62.56	62.90	-	-
AccumulateAttn (?)	65.56	68.76	60.63	-	-
ParallelAttn (?)	64.18	66.31	61.46	-	-
MAttNet+ResNet-101 (?)	71.01	75.13	66.17	78.10	78.12
RVG-Tree+ResNet-101 (?)	68.86	70.21	65.49	76.82	75.20
CMRIN+ResNet-101 (?)	75.52	80.93	68.99	80.45	80.66

Table 64. Table 64: Statistics of the ORGaze dataset.

Videos	Objects	Condition	Lighting	Annotations
				Bounding Boxes
5,000	30,000	Urban	Daytime	Gaze Recordings
				Language Expression

Table 65. Table 65: Exemplar Video Referring Expression and Comprehension architectures.

Approach	Video	Frame	Language	Combined	Optimizer	RL
(?)	-	VGG	LSTM	WithGaze	-	✗

Table 66. Table 66 : Comparison of Top-1 Accuracy (%) of different methods on the ORGaze dataset.

Methods	Edgebox	FRCNN ( $↑$ )	LOP ( $↑$ )
MNLM (?)	-	23.954	32.418
VSEM (?)	-	24.833	32.961
MCB (?)	-	26.445	33.366
SimModel (?)	4.5	18.431	35.556
WithGaze (?)	-	47.256	47.012

Table 67. Table 67: Splits of the VQA v1.0 dataset with real scenes.

Dataset	Real	Questions	Answers	Textual Annotations
Split	Scenes	per Image	per Question	Questions	Answers
Training	82,783	3	10	248,349	2,483,490
Validation	40,504	3	10	121,512	1,215,120
Test	81,434	3	10	244,302	2,443,020

Table 68. Table 68: Splits of the VQA v1.0 dataset with abstract scenes.

Dataset	Abstract	Questions	Answers	Textual Annotations
Split	Scenes	per Image	per Question	Questions	Answers
Training	20,000	3	10	60,000	600,000
Validation	10,000	3	10	30,000	300,000
Test	20,000	3	10	60,000	600,000

Table 69. Table 69: Splits of the VQA v2.0 dataset with balanced real images.

Dataset	Real	Answers	Textual Annotations
Split	Images	per Question	Questions	Answers	Complementary Pairs
Training	82,783	10	443,757	4,437,570	200,394
Validation	40,504	10	214,354	2,143,540	95,144
Test	81,434	10	447,793	4,477,930	-

Table 70. Table 70: Splits of VQA v2.0 with balanced binary abstract scenes.

Dataset	Binary Abstract	Answers	Textual Annotations
Split	Scenes	per Question	Questions	Answers
Training	20,629	10	22,055	220,550
Validation	10,696	10	11,328	113,280

Table 71. Table 71: Statistics of the OK-VQA dataset.

Total	Total	Answers per	Unique	Unique	Unique	Total	Average
Images	Questions	Question	Questions	Answers	Ques. Words	Categories	Ans. Length
14,031	14,055	5	12,591	14,454	7,178	10 + 1	1.3

Table 72. Table 72: Splits of the OK-VQA dataset.

Split	Percent (%)	Questions
Training	64	9,009
Test	36	5,046
Total	100	14,055

Table 73. Table 73: Statistics of the KVQA dataset.

Total	Q&A	Unique	Unique	Avg.	Avg.	Avg. number of
Images	Pairs	Named Entities	Answers	Ques. Len	Ans. Len	Questions per Image
24,602	183,007	18,880	19,571	10.14	1.64	7.44

Table 74. Table 74: Splits of the KVQA dataset.

Split	Percent (%)	Images	Q&A pairs
Training	70	17k	130k
Validation	20	5k	34k
Test	10	2k	19k

Table 75. Table 75: Statistics & Splits of the MovieQA dataset. The column ‘Total’ represents mean counts with standard deviations.

Movies with Plots and Subtitles
	Training	Validation	Test	Total
Movies	269	56	83	408
QA pairs	9848	1958	3138	14944
Q words	9.3	9.3	9.5	9.3 $\pm$ 3.5
CA. words	5.7	5.4	5.4	5.6 $\pm$ 4.1
Movies with Video Clips
Movies	93	21	26	140
QA pairs	4318	886	1258	6462
Video clips	4385	1098	1288	6771
Mean clip Length	201.0 s	198.5 s	211.4s	202.7 $\pm$ 216.2 s
Mean QA shots	45.6	49.0	46.6	46.3 $\pm$ 57.1

Table 76. Table 76: Statistics of the TVQA dataset.

Video	Video Clip	Q&A	Total	Questions per	Answers per
Clips	Length	Pairs	Duration	Video Clip	Video Clip
21,793	60 to 90 s	152,545	460 h	7	5

Table 77. Table 77: Splits of the TVQA dataset.

Split	Percent (%)	Q&A pairs
Training	80	122,039
Validation	10	15,253
Test	10	15,253

Table 78. Table 78: Splits of the TVQA+ dataset.

			Avg. Span	Avg. Video	Annotated	Bound.
Split	Q&As	Clips	Length (s)	Length (s)	Images	Boxes	Categories
Training	23,545	3,364	7.20	61.49	118,930	249,236	2,281
Validation	3,017	431	7.26	61.48	15,350	32,682	769
Test	2,821	403	7.18	61.48	14,188	28,908	680
Total	29,383	4,198	7.20	61.49	148,468	310,826	2,527

Table 79. Table 79: Exemplar Video Question Answering architectures.

Approach	Video	Frame	Language	Combined	Optimizer	RL
(?)	C3D	ResNet-152	LSTM	ST-VQA	ADAM	✗
(?)	-	R-CNN+ResNet-101	BiLSTM	Two-stream	-	✗
(?)	-	R-CNN+ResNet-101	BERT	STAGE	ADAM	✗

Table 80. Table 80: Accuracy attained on TVQA test (public) set. All models use timestamp annotation without which the scores achieved by them are lower.

Model	Accuracy ( $↑$ )
Random	20.00
Retrieval-SkipThought	24.77
Longest Answer	30.22
NNS-SkipThought (Subtitle)	38.29
NNS-TFIDF (Subtitle)	50.79
Two-stream (Subtitle+Videos) (?)	66.36
Three-stream (Subtitle+Videos+Questions) (?)	68.48

Table 81. Table 81: Results obtained on TVQA+ test set.

Model	Accuracy	Grd. mAP ( $↑$ )	Temp. mIOU ( $↑$ )	ASA ( $↑$ )
ST-VQA (?)	48.28	-	-	-
Two-stream (?)	68.13	-	-	-
STAGE-LXMERT (?)	71.46	21.01	26.31	18.04
STAGE (?)	74.83	27.34	32.49	22.23
Human (?)	90.46	-	-	-

Table 82. Table 82: Splits of the CLEVR dataset.

Split	Images	Questions	Unique Questions	Overlap with train
Training	70,000	699,989	608,607	-
Validation	15,000	149,991	140,448	17,338
Test	15,000	149,988	140,352	17,335
Total	100,000	999,968	853,554	-

Table 83. Table 83: Splits of the NLVR dataset. Test-P and Test-U means Test set (public) and Test set (unreleased) respectively.

Split	Unique Sentences	Examples
Training	3,163	74,460
Validation	267	5,940
Test-P	266	5,934
Test-U	266	5,910
Total	3,962	92,244

Table 84. Table 84: Splits of the NLVR2 dataset. Test-P denotes Test set Public, whereas Test-U means Test set Unreleased.

Split	Unique Sentences	Examples
Training	23,671	86,373
Validation	2,018	6,982
Test-P	1,995	6,967
Test-U	1,996	6,970
Total	29,680	107,292

Table 85. Table 85: Conditions in the CLEVR-CoGenT dataset.

Geometrical Shape	Condition	Colors of Geometrical Shape
Cubes	A	gray, blue, brown, yellow
Cubes	B	red, green, purple, cyan
Cylinders	A	red, green, purple, cyan
Cylinders	B	gray, blue, brown, yellow
Spheres	A	any color
Spheres	B	any color

Table 86. Table 86: Splits of the CLEVR-CoGenT dataset.

Split	Condition	Images	Questions
Training	A	70,000	699,960
Validation	A	15,000	150,000
Validation	B	15,000	149,991
Test	B	15,000	149,980
Test	B	15,000	149,992

Table 87. Table 87: Statistics & splits of the GQA dataset.

Images	Questions	Vocabulary Size	Training	Validation	Testing	Challenge
113,018	22,669,678	3,097	70%	10%	10%	10%

Table 88. Table 88: Statistics of the RAVEN dataset.

	RPM	Tree-structure	Structural	Rule	Avg. rules
Images	Problems	per problem	Labels	Annotations	per problem
1,120,000	70,000	16	1,120,000	440, 000	6.29

Table 89. Table 89: High-level statistics of the VCR dataset. One fold in the dataset was held-out for blind evaluation at a later date. Hence, the statistics of that fold are not shown here.

Dataset Characteristic	Train	Validation	Test
Number of questions	212,923	26,534	25,263
Number of answers per question	4	4	4
Number of rationales per question	4	4	4
Number of images	80,418	9,929	9,557
Number of movies covered	1,945	244	189
Average question length	6.61	6.63	6.58
Average answer length	7.54	7.65	7.55
Average rationale length	16.16	16.19	16.07
Average num. of objects mentioned	1.84	1.85	1.82

Table 90. Table 90 : Statistics and splits of the Visual Commonsense Graph dataset.

	Images/	Events at	Inferences on			Total
Split	Places	Present	Events Before	Intents at Present	Events After	Inferences
Train	47,595	111,796	467,025	237,608	469,430	1,174,063
Dev	5,973	13,768	58,773	28,904	58,665	146,332
Test	5,968	13,813	58,413	28,568	58,323	145,309
Total	59,356	139,377	584,211	295,080	586,418	1,465,704

Table 91. Table 91: Exemplar Image Reasoning architectures. “Custom” - Own CNN architecture.

Approach	Image	Language	Combined	Optimizer	RL
(?)	ResNet-101	LSTM	SA+MLP	ADAM	✗
(?)	VGG	LSTM	N2NMN	ADAM	✓
(?)	ResNet-101	LSTM	PGEE	ADAM	✓
(?)	Custom	LSTM	RN	ADAM	✗
(?)	ResNet-101	BiLSTM	ACMN	ADAM	✗
(?)	ResNet-101	GRU	FiLM	ADAM	✗
(?)	ResNet-101	BiLSTM	MAC	ADAM	✗
(?)	ResNet-101	-	TbD	ADAM	✗
(?)	ResNet-152	LSTM	FinalDestGraph	ADAM	✗
(?)	ResNet-101	LSTM	LCGN	ADAM	✗
(?)	ResNet-34	BiGRU	NS-CL	-	✓

Table 92. Table 92: Comparison of different models on the CLEVR dataset.

Model	Count	Exist	CN	QA	CA	Overall
CNN+LSTM+SA+MLP (?)	59.7	77.9	75.1	80.9	70.8	73.2
N2NMN+700KProgLabel (?)	68.5	85.7	84.9	90.0	88.7	83.7
PGEE+700KProgLabel (?)	92.7	97.1	98.7	98.1	98.9	96.9
CNN+LSTM+RN (?)	90.1	97.8	93.6	97.9	97.1	95.5
ACMN (?)	94.2	81.3	81.6	90.5	97.1	89.3
CNN+GRU+FiLM (?)	94.3	99.1	96.8	99.1	99.1	97.7
MAC (?)	97.2	99.5	99.4	99.3	99.5	98.9
TbD+700KProgLabel (?)	97.6	99.2	99.4	99.5	99.6	99.1
FinalDestGraph (?)	91.3	98.6	99.6	99.5	99.8	97.5
LCGN+single-hop (?)	-	-	-	-	-	97.9
NS-CL (?)	98.2	98.8	99.0	99.3	99.1	98.9

Table 93. Table 93: Comparison of accuracy (%) scores of different methods on the validation (val), test-dev, and test splits of the GQA dataset.

Model	val	test-dev	test
CNN+LSTM (?)	49.2	-	46.6
Bottom-up (?)	52.2	-	49.7
MAC (?)	57.5	-	54.1
LCGN+single-hop (?)	63.8	55.6	56.0

Table 94. Table 94: Comparison of accuracy (%) scores of different models on the validation (val) and test splits of the VCR dataset.

	(Q $\to$ A)		(QA $\to$ R)		(Q $\to$ AR)
Model	val	test	val	test	val	test
R2C (?)	63.8	65.1	67.2	67.3	43.1	44.0
ViLBERT (?)	72.4	73.3	74.5	74.6	54.0	54.8
B2T2 (?)	71.9	72.6	76.0	75.7	54.9	55.0
VL-BERT (?)	73.7	74.0	74.5	74.8	55.0	55.5
Unicoder-VL (?)	72.6	73.4	74.5	74.4	54.5	54.9

Table 95. Table 95: Comparison of accuracy (%) scores of different models on the RAVEN dataset.

		2x2	3x3
Model	Acc	Grid	Grid	L-R	U-D	O-IC	O-IG
WReNDRT (?)	15.02	23.26	29.51	6.99	8.43	8.93	12.35
ResNetDRT (?)	59.56	46.53	50.40	65.82	67.11	69.09	60.11
Human (?)	84.41	81.82	79.55	86.36	81.81	86.36	81.81
PerfectSolver	100	100	100	100	100	100	100

Table 96. Table 96: Splits of the COG dataset.

	Total	Examples per
Split	Examples	Task Family
Training	10,000,320	227,280
Validation	500,016	11,364
Test	500,016	11,364

Table 97. Table 97: Exemplar Video Reasoning architectures.

Approach	Video	Frame	Language	Combined	RL
(?)	-	Custom	LSTM	WorkMemory	✗
(?)	-	ResNet-152	LSTM	FinalDestGraph	✗

Table 98. Table 98: Comparison of measures using different methods on the COG dataset.

Model	Atts	Condit	Point	Yes/No	All
WorkMemory (?)	-	-	-	-	93.7
QuestionNodes (?)	73.7	63.5	92.5	57.9	63.3
FinalDestGraph (?)	99.2	98.4	100.0	95.0	97.2

Table 99. Table 99: Splits of the V-SNLI dataset.

Split	Entailment	Neutral	Contradiction
Training	182,167	181,515	181,938
Validation	3,329	3,235	3,278
Test	3,368	3,219	3,237
V-SNLI $_{hard}$ Test	1,058	1,068	1,135

Table 100. Table 100: Splits of the SNLI-VE dataset.

Split	Images	Entailment	Neutral	Contradiction	Vocab
Training	29,783	176,932	176,045	176,550	29,550
Validation	1000	5,959	5,960	5,939	6,576
Test	1000	5,973	5,964	5,964	6,592

Table 101. Table 101: Exemplar Image Entailment architectures.

Approach	Image	Language	Combined	Optimizer	RL
(?)	VGG	BiLSTM	V-BiMPM	ADAM	✗
(?)	ResNet-101	GRU	EVE-Image	ADAM	✗

Table 102. Table 102: Comparison of accuracies (%) of different models on the SNLI-VE dataset.

Model	Contradiction	Neutral	Entailment	Overall
Relation Network (?)	67.29	68.86	66.50	67.55
Bottom-up (?)	70.52	70.96	65.23	68.90
Top-Down (?)	69.72	69.33	71.86	70.3
Hypothesis Only (?)	67.60	67.71	64.83	66.71
EVE-ROI (?)	67.69	69.45	74.25	70.47
EVE-Image (?)	71.56	70.52	71.39	71.16

Table 103. Table 103: Comparison of accuracies (%) of different models on the V-SNLI dataset.

Model	Contradiction	Neutral	Entailment	Overall
Hypothesis Only (?)	66.29	66.36	72.65	68.49
LSTM (blind) (?)	79.7	76.79	87.71	81.49
V-LSTM (?)	71.39	68.06	87.14	75.70
BiMPM (?)	86.25	82.79	90.03	86.41
V-BiMPM (?)	87.53	82.91	90.38	86.99

Table 104. Table 104: Comparison of accuracy (%) scores of various models on V-SNLI hard hard {}_{\text{hard}} .

Model	Contradiction	Neutral	Entailment	Overall
Hypothesis Only (?)	25.29	20.22	31.28	25.57
LSTM (blind) (?)	60.79	50.19	72.12	60.99
V-LSTM (?)	46.34	32.02	69.09	49.03
BiMPM (?)	77.62	59.36	80.43	72.55
V-BiMPM (?)	76.12	63.67	81.38	73.75

Table 105. Table 105: Statistics of different video sources in the VIOLIN dataset.

Video Source	Num. of	Num. of	Avg. Clip	Avg. Pos.	Avg. Neg.	Avg. Sub-
(TV Show/Movie Clips)	Episodes	Clips	Len	Stmnt Len	Stmnt Len	Title Len
Friends	234	2,676	32.89s	17.94	17.85	72.80
Desperate Housewives	180	3,466	32.56s	17.79	17.81	69.19
How I Met Your Mother	207	1,944	31.64s	18.08	18.06	76.78
Modern Family	210	1,917	32.04s	18.52	18.20	98.50
MovieClips	5,885	5,885	40.00s	17.79	17.81	69.20
All	6,716	15,887	35.20s	18.10	18.04	76.40

Table 106. Table 106: Splits of the VIOLIN dataset.

	Number of	Number of	Number of
Split	Videos (V)	Hypotheses (H)	Triplets (V, S, H)
Training	12,687	76,122	76,122
Validation	1,600	9,600	9,600
Testing	1,600	9,600	9,600
Total	15,887	95,322	95,322

Table 107. Table 107: Exemplar Video Entailment architectures. SSV - Statement+Subtitles+Visual.

Approach	Video	Frame	Language	Combined	Optimizer	RL
(?)	-	Detection Feat	BERT	SSV	ADAM	✗

Table 108. Table 108: Comparison of accuracies (%) of different methods on the VIOLIN dataset.

Model	Visual	Text	Accuracy
Statement (?)	-	BERT	54.20
Statement+Visual (?)	Detection Feat	BERT	59.45
Statement+Subtitles (?)	-	BERT	66.05
SSV (?)	LXMERT	LXMERT	66.25
SSV (?)	Detection Feat	BERT	67.84

Table 109. Table 109: Splits of the VisDial v0.9 dataset.

Split	Images	Questions	Answers	Dialog Turns
Training	82,783	827,830	827,830	10
Validation	40,504	405,040	405,040	10
Test	-	-	-	-

Table 110. Table 110: Splits of the VisDial v1.0 dataset.

Split	Images	Questions	Answers	Dialog Turns
Training	123,287	1,232,870	1,232,870	10
Validation	2,064	20,640	20,640	10
Test	8,000	80,000	80,000	1

Table 111. Table 111: Statistics of the CLEVR-Dialog dataset.

CLEVR	Total	Total	Unique	Unique	Vocabulary	Dialog	Mean Ques.
Images	Dialogs	Questions	Questions	Answers	Size	Turns	Length
85k	425k	4.25M	73k	29	125	10	10.6

Table 112. Table 112: Splits of the CLEVR-Dialog dataset.

Split	Images	Q&A Pairs	Instances	Dialog Rounds
Training	70,000	3.5M	5	10
Validation	15,000	0.75M	5	10
Test	-	-	-	-

Table 113. Table 113: Exemplar Image Dialog Architectures (Discriminative and Generative).

Approach	Image	Language	Combined	RL
(?)	VGG	LSTM	MemoryNetwork	✗
(?)	VGG	LSTM	HCIAE-NP-ATT	✗
(?)	VGG	LSTM	AMEM	✗
(?)	VGG	LSTM	SF	✗
(?)	ResNet-152	LSTM	CorefNMN	✗
(?)	VGG	LSTM	CoAtt-GAN	✓
(?)	ResNet-152	LSTM	RvA	✗
(?)	VGG	LSTM	GNN	✗
(?)	ResNet-101	LSTM	Synergistic	✗

Table 114. Table 114: Results of different discriminative models on the validation split of the VisDial v0.9 dataset.

Model	MRR	R@1	R@5	R@10	Mean
LF (?)	0.5807	43.82	74.68	84.07	5.78
HRE (?)	0.5846	44.67	74.50	84.22	5.72
HREA (?)	0.5868	44.82	74.81	84.36	5.66
MN (?)	0.5965	45.55	76.22	85.37	5.46
HCIAE-NP-ATT (?)	0.6222	48.48	78.75	87.59	4.81
AMEM (?)	0.6227	48.53	78.66	87.43	4.86
CoAtt (?)	0.6398	50.29	80.71	88.81	4.47
SF (?)	0.6242	48.55	78.96	87.75	4.70
SCA (?)	0.6398	50.29	80.71	88.81	4.47
CorefNMN (?)	0.641	50.92	80.18	88.81	4.45
GNN (?)	0.6285	48.95	79.65	88.36	4.57
RvA (?)	0.6634	52.71	82.97	90.73	3.93

Table 115. Table 115: Results of different generative models on the validation split of the VisDial v0.9 dataset.

Model	MRR	R@1	R@5	R@10	Mean
LF (?)	0.5199	41.83	61.78	67.59	17.07
HRE (?)	0.5237	42.29	62.18	67.92	17.07
HREA (?)	0.5242	42.28	62.33	68.17	16.79
MN (?)	0.5259	42.29	62.85	68.88	17.06
HCIAE-NP-ATT (?)	0.5386	44.06	63.55	69.24	16.01
CorefNMN (?)	0.535	43.66	63.54	69.93	15.69
CoAtt (?)	0.5411	44.32	63.82	69.75	16.47
CoAtt-RL (?)	0.5578	46.10	65.69	71.74	14.43
RvA (?)	0.5543	45.37	65.27	72.97	10.71

Table 116. Table 116: Results of different discriminative models on the test-standard split of the VisDial v1.0 dataset.

Model	MRR	R@1	R@5	R@10	Mean	NDCG
LF (?)	0.5542	40.95	72.45	82.83	5.95	0.4531
LF-att (?)	0.5707	42.08	74.83	85.05	5.59	0.4976
HRE (?)	0.5416	39.93	70.45	81.50	6.41	0.4546
MN (?)	0.5549	40.98	72.30	83.30	5.92	0.4750
MN-att (?)	0.5690	42.43	74.00	84.35	5.59	0.4958
CorefNMN (?)	0.615	47.55	78.10	88.80	4.40	0.547
GNN (?)	0.6137	47.33	77.98	87.83	4.57	0.5282
RvA (?)	0.6303	49.03	80.40	89.83	4.18	0.5559
Synergistic-ensemble (?)	0.6342	49.30	80.77	90.68	3.97	0.5788

Table 117. Table 117: Splits of the AVSD dataset.

Split	Dialogs	Turns	Words
Training	7,985	123,480	1,163,969
Validation	1,863	14,680	138,314
Test	1,968	14,660	138,790

Table 118. Table 118: Exemplar Video Dialog architectures.

Approach	Video	Frame	Language	Combined	Optimizer	RL
(?)	I3D	VGG	LSTM	MultimodalAtt	ADAM	✗
(?)	I3D	VGG	LSTM	i3d-rgb-spatial-10	ADAM	✗

Table 119. Table 119: Results of different models on the “AVSD” dataset.

Model	B-1	B-2	B-3	B-4	METEOR	CIDEr
Att-base (?)	0.273	0.173	0.117	0.084	0.117	0.766
Att-weightshare (?)	0.293	0.191	0.133	0.097	0.127	0.923
i3d-rgb-spatial-10 (?)	0.290	0.190	0.133	0.097	0.127	0.928
Att-base-beam (?)	0.285	0.187	0.131	0.096	0.128	0.941

Table 120. Table 120: Splits of Multi30k-MMT for English, German, French, and Czech.

Split	Images	Captions
Training	29,000	29,000
Validation	1,014	1,014
Test	1,000	1,000

Table 121. Table 121: Exemplar Machine Translation with Image architectures. * - compares with ResNet-50 and VGG also.

Approach	Image	Language	Combined	Optimizer	RL
(?)	ResNet-50	BiGRU	DoubleAtt	Adadelta	✗
(?)	VGG	BiGRU	GVF	Adadelta	✗
(?)	Inception-V3*	BiGRU	Imagination	ADAM	✗
(?)	ResNet-50	BiGRU	Lium-cvc-ensemble	ADAM	✗
(?)	ResNet-50	BiGRU	VMMT $_{F}$	ADAM	✗
(?)	ResNet-50	LSTM	CUNI-ensemble	ADAM	✗

Table 122. Table 122: Machine Translation with Image on the Multi30k test set [2016 (en → → \rightarrow de), 2017 (en → → \rightarrow fr), 2018 (en → → \rightarrow cs)].

Results of Different Methods
Model	Language	en $\to$ de	en $\to$ fr	en $\to$ cs
	BLEU	36.5	-	-
DoubleAtt (?)	METEOR	55.0	-	-
	BLEU	37.3	-	-
GVF (?)	METEOR	55.1	-	-
	BLEU	36.8	-	-
Imagination (?)	METEOR	55.8	-	-
	BLEU	41.0	56.7	-
Lium-cvc-ensemble (?)	METEOR	60.5	73.0	-
	BLEU	37.6	-	-
VMMT $_{F}$ (?)	METEOR	56.0	-	-
	BLEU	42.6	62.8	35.9
CUNI-ensemble (?)	METEOR	59.4	77.0	32.7

Table 123. Table 123: Machine Translation with Image on Multi30k test set [2018 (en → → \rightarrow de, en → → \rightarrow fr, en → → \rightarrow cs)].

Results of Different Methods
Model	Language	en $\to$ de	en $\to$ fr	en $\to$ cs
	BLEU	32.5	40.6	31.8
CUNI-single (?)	METEOR	52.3	61.0	30.6
	BLEU	38.5	44.1	-
MeMAD (?)	METEOR	56.6	64.3	-

Table 124. Table 124: Splits of the VATEX dataset. Secret Test denotes human-annotated captions heldout for organizing challenges; Hence, this split is unavailable to the public.

Split	Videos	Action Label
Training	25,991	✓
Validation	3,000	✓
Public Test	6,000	-
Secret Test	6,278	-

Table 125. Table 125: Exemplar Machine Translation with Video architectures.

Approach	Video	Frame	Language	Combined	Optimizer	RL
(?)	I3D	-	LSTM	NMT+LSTM VI	ADAM	✗

Table 126. Table 126: Comparison of different methods on the VATEX dataset.

Model	B-4	METEOR
NMT+LSTM VI (?) [English $\to$ Chinese]	30.20	-
NMT+LSTM VI (?) [Chinese $\to$ English]	27.18	-

Table 127. Table 127: Splits of the Oxford-102 dataset with image descriptions.

Split	Images	Captions per Image	Total Captions
Training	5,878	10	58,780
Validation	1,156	10	11,560
Test	1,155	10	11,550
Total	8,189	10	81,890

Table 128. Table 128: Splits of the CUB dataset with image descriptions.

Split	Images	Captions per Image	Total Captions
Training	8,855	10	88,550
Validation	-	-	-
Test	2,933	10	29,330
Total	11,788	10	117,880

Table 129. Table 129: Splits of the MSCOCO-Gen dataset.

Split	Images	Captions per Image	Total Captions
Training	82,783	5	413,915
Validation	-	-	-
Test	40,504	5	202,520
Total	123,287	5	616,435

Table 130. Table 130: Exemplar Language-to-Image Generation architectures.

Approach	Image	Language	Combined	Optimizer	RL
(?)	-	char-CNN-RNN	GAN-INT-CLS	ADAM	✗
(?)	-	char-CNN-GRU	GAWWN	ADAM	✗
(?)	-	-	StackGAN	ADAM	✗
(?)	Inception-v3	BiLSTM	AttGAN	-	✗
(?)	-	BiLSTM	MirrorGAN	-	✗

Table 131. Table 131: Comparison of different methods using generated images of different resolutions on the “CUB” dataset. R-precision (%) for 256x256 with AttGAN (53.31) and MirrorGAN (57.67). HR - Human Ranking.

Model	Resolution	IS	FID	HR
GAN-INT-CLS (?)	64x64	2.88 $\pm$ .04	68.79	2.76 $\pm$ .01
	64x64	3.10 $\pm$ .03	53.51	-
GAWWN (?)	128x128	3.62 $\pm$ .07	72.65	1.95 $\pm$ .02
	64x64	3.02 $\pm$ .03	35.11	-
StackGAN (?)	256x256	3.70 $\pm$ .04	51.89	1.29 $\pm$ .02
StackGAN++ (?)	256x256	4.04 $\pm$ .05	15.30	1.19 $\pm$ .02
AttGAN (?)	256x256	4.36 $\pm$ .03	-	-
MirrorGAN (?)	256x256	4.56 $\pm$ .05	-	-

Table 132. Table 132: Comparison of different methods using generated images of different resolutions on the “Oxford-102” dataset.

Model	Resolution	IS	FID	HR
GAN-INT-CLS (?)	64x64	2.66 $\pm$ .03	79.55	1.84 $\pm$ .02
	64x64	2.73 $\pm$ .03	43.02	-
StackGAN (?)	256x256	3.20 $\pm$ .01	55.28	1.16 $\pm$ .02
StackGAN++ (?)	256x256	3.26 $\pm$ .01	48.68	1.30 $\pm$ .03

Table 133. Table 133: Comparison of different methods using generated images of different resolutions on the “COCO” dataset. R-precision (%) for 256x256 with AttGAN (72.13) and MirrorGAN (74.52).

Model	Resolution	IS	FID	HR
GAN-INT-CLS (?)	64x64	7.88 $\pm$ .07	60.62	1.82 $\pm$ .03
	64x64	8.35 $\pm$ .11	33.88	-
StackGAN (?)	256x256	8.45 $\pm$ .03	74.05	1.18 $\pm$ .03
StackGAN++ (?)	256x256	8.30 $\pm$ .10	81.59	1.55 $\pm$ .05
PPGN (?)	256x256	9.58 $\pm$ .21	-	-
AttGAN (?)	256x256	25.89 $\pm$ .47	-	-
MirrorGAN (?)	256x256	26.47 $\pm$ .41	-	-

Table 134. Table 134: Splits of Text2Video (Combines all categories).

Split	Videos
Training	2800
Validation	400
Test	800

Table 135. Table 135: Exemplar Language-to-Video Generation architectures.

Approach	Video	Frame	Language	Combined	Optimizer	RL
(?)	MotionFeatures	-	LSTM	T2V	ADAM	✗

Table 136. Table 136: Comparison of accuracy (%) scores of different models on Text2Video.

Model	Accuracy
DT2V-baseline (?)	0.101
PT2V (?)	0.134
GT2V (?)	0.192
T2V (?)	0.426

Table 137. Table 137: Splits of the R2R dataset.

Split	Scenes	Navigation Instructions
Training	61	14,025
Validation (seen)	11	1,020
Validation (unseen)	11	2,349
Test	18	4,173

Table 138. Table 138: Splits of the ASKNAV dataset.

Split	Data points	Goals
Training	94,798	139,757
Validation (seen)	4,874	7,768
Validation (unseen)	5,005	8,245
Test (seen)	4,917	7,470
Test (unseen)	5,001	7,537

Table 139. Table 139: Statistics of the TOUCHDOWN dataset. Vocabulary Size and Text Length are computed by combining the training and validation sets.

	Dataset	Vocab.	Mean Text
Dataset	Size	Size	Length
TOUCHDOWN (Complete task)	9,326	5,625	108.0
Navigation Only	9,326	4,999	89.6
SDR Only	25,575	3,419	29.7

Table 140. Table 140: Splits of the TOUCHDOWN dataset.

Task	Split	Examples
	Training	6,526
Complete &	Validation	1,391
Navigation Only	Test	1,409
	Training	17,880
SDR Only	Validation	3,836
	Test	3,859

Table 141. Table 141: Statistics of the CVDN dataset.

Navigation Dialogs	Navigation	Total Scenes
(Human-Human)	Trajectories	(MatterPort houses)
2,050	7,000	83

Table 142. Table 142: Splits of the ALFRED dataset.

Data		Number of	Number of
Split	Fold	Scenes	Annotations
Training	-	108	21,023
	Seen	88	820
Validation	Unseen	4	821
	Seen	107	1,533
Testing	Unseen	8	1,529

Table 143. Table 143: Exemplar Image-and-Language Navigation architectures.

Approach	Image	Language	Combined	Optimizer	RL
(?)	ResNet-152	LSTM	Seq-to-Seq	ADAM	✗
(?)	ResNet-152	LSTM	RPA	-	✓
(?)	ResNet-152	LSTM	Speaker-Follower	-	✓
(?)	ResNet-152	LSTM	RCM	ADAM	✓
(?)	ResNet-152	LSTM	Self-Monitoring	ADAM	✗
(?)	ResNet-152	LSTM	BackTranslation	RMSprop	✓
(?)	-	LSTM	FAST	-	✗

Table 144. Table 144: Comparison of different methods on the R2R test set.

Model	PL	NE	OSR	SR	SPL
Random	9.89	9.79	18.3	13.2	12
Seq-to-Seq (?)	8.13	7.85	26.6	20.4	18
RPA (?)	9.15	7.53	32.5	25.3	23
Speaker-Follower (?)	14.82	6.62	44.0	35.0	28
Self-Monitoring (?)	18.0	-	-	48.0	35
RCM (?)	15.22	6.01	50.8	43.1	35
BackTranslation-Single (?)	11.7	-	-	51.5	47
TacticalRewind-Greedy (?)	22.08	5.14	-	54	41
BackTranslation-PreExplore (?)	9.79	-	-	63.9	61
BackTranslation-Beam (?)	687	-	-	68.9	1
FAST-Beam (?)	196.53	4.29	-	61.0	3

Table 145. Table 145: Comparison of different methods on the seen validation set of R2R.

Model	PL	NE	OSR	SR	SPL
Speaker-Follower (?)	-	3.36	73.8	66.4	-
RCM+SIL (?)	10.13	2.78	79.7	73.0	-
BackTranslation-Single (?)	11.0	3.99	-	62.1	59
TacticalRewind-Greedy (?)	-	-	-	-	-
BackTranslation-PreExplore (?)	9.92	4.84	-	54.7	52
BackTranslation-Beam (?)	703	2.52	-	75.7	1
FAST-Beam (?)	188.6	3.13	-	70.0	4

Table 146. Table 146: Comparison of different methods on the unseen validation set of R2R.

Model	PL	NE	OSR	SR	SPL
Speaker-Follower (?)	-	3.36	73.8	66.4	-
RCM+SIL (?)	10.13	2.78	79.7	73.0	-
BackTranslation-Single (?)	10.7	5.22	-	52.2	48
TacticalRewind-Greedy (?)	21.17	4.97	-	56.0	43
BackTranslation-PreExplore (?)	9.57	3.78	-	64.5	61
BackTranslation-Beam (?)	663	3.08	-	69.0	1
FAST-Beam (?)	224.42	4.03	-	63.0	2

Table 147. Table 147: Major Vision-and-Language Pretraining Architectures and their support for various Vision and Language Tasks. VDG - Visual Description Generation, VS - Visual Storytelling, VRE - Visual Referring Expression, VQA - Visual Question Answering, VR - Visual Reasoning, VE - Visual Entailment, VDiag - Visual Dialog, MMT - Multimodal Machine Translation, LVG - Language-to-Vision Generation, VLN - Vision-and-Language Navigation.

Approach	VDG	VS	VRE	VQA	VR	VE	VDiag	MMT	LVG	VLN
					Single-stream
Unicoder-VL	✗	✗	✗	✗	✓	✗	✗	✗	✗	✗
VL-BERT	✗	✗	✓	✓	✓	✗	✗	✗	✗	✗
VideoBERT	✓	✗	✗	✗	✗	✗	✗	✗	✗	✗
VLP	✓	✗	✗	✓	✗	✗	✗	✗	✗	✗
OSCAR	✓	✗	✗	✓	✓	✗	✗	✗	✗	✗
B2T2	✗	✗	✗	✗	✓	✗	✗	✗	✗	✗
UNITER	✗	✗	✓	✓	✓	✓	✗	✗	✗	✗
VinVL	✓	✗	✗	✓	✓	✗	✗	✗	✗	✗
					Two-stream
ViLBERT	✗	✗	✓	✓	✓	✗	✗	✗	✗	✗
LXMERT	✗	✗	✗	✓	✓	✗	✗	✗	✗	✗

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Trends in Integration of Vision and Language Research:

A Survey of Tasks, Datasets, and Methods

\nameAditya Mogadala \[email protected]

\nameMarimuthu Kalimuthu \[email protected]

\nameDietrich Klakow \[email protected]

\addrSpoken Language Systems (LSV)

Saarland Informatics Campus

Saarland University

66123 Saarbrücken, Germany

Abstract

Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey stimulates innovative thoughts and ideas to address the existing challenges and build new applications.

1 Introduction

Recent wave of unprecedented progress in deep learning methods has advanced the fields of Computer Vision (CV) and Natural Language Processing (NLP) to an extent that they are now making significant progress across several challenging tasks. Independent of NLP, computer vision has achieved prominent improvements in tasks such as visual content classification (?), object detection (?), semantic segmentation (?), etc., using large annotated datasets or by employing self-supervision (?) on large-scale unlabeled data. Similarly, independent from computer vision, NLP has seen a surge of interest in solving multiple tasks at once with unsupervised pretraining of language models (?, ?, ?, ?) using large unlabeled corpora. However, there is a growing interest in solving challenges that combine linguistic and visual information from these traditionally independent fields. The methods which address the challenge of integration should provide a complete understanding of visual and/or textual content, and are expected to (1) generate comprehensible but concise and grammatically well-formed descriptions of the visual content, or vice versa by generating the visual content for a given textual description in a natural language of choice, (2) identify objects in the visual content and infer their relationships to reason about, or answer arbitrary questions about them, (3) navigate through an environment by leveraging input from both vision and natural language instructions, (4) translate textual content from one language to another while leveraging the visual content for sense disambiguation, (5) generate stories about the visual content, and so on. Designing methods which can process and relate information from multiple modalities (i.e., linguistic and visual information) is usually considered to be a sub-part of multimodal learning models (?).

Efficiently solving the above mentioned and other related challenges can result in many potential real-world applications. For example, visually impaired individuals can be assisted by visual scene understanding, where they can get information about a scene from generated descriptions and by being able to ask questions about it. Other applications include automatic surveillance (?), autonomous driving (?), human-computer interaction (?), city navigation (?), and so on. Also, solving such challenges can serve as an excellent test bed for computer vision and NLP systems, one that is much more intelligent and comprehensive than independent computer vision and NLP evaluations.

Given such a broad scope for fundamental and applied research, there has been several surveys in recent years aiming to provide a comprehensive overview of the integration of vision and language tasks. These surveys, however, have restricted themselves on covering specific vision and language integration tasks such as image description (?, ?, ?) or video description generation (?), visual question answering (?, ?), action recognition (?) and visual semantics (?). The surveys which went beyond these specific tasks have summarized dataset statistics (?), provided a comprehensive overview of only NLP tasks such as natural language generation (NLG) (?, ?) and commonsense reasoning (?). However, there was also an attempt to cover multiple modalities (including sound) (?), but it was structured in a bottom-up manner giving more importance to underlying fusion technologies than the task itself. Also, there was some interest in understanding the limitations of the integration of vision and language research (?). However, it is limited to the task of language-grounded image understanding. Furthermore, there were ideas to develop theories on the complementarity of language and visual data in the human-machine communication from a theoretical point of view (?).

With our efforts in this survey, we go beyond these and present a comprehensive overview of ten different tasks that are prominent in the current integration of vision and language research. We first begin with a background on the traditional tasks in computer vision and NLP separately, and then show how they facilitate in designing the prominent ten tasks for the integration of vision and language modalities in Section 2. Following that, we provide an in-depth exploration of each of the ten tasks and present details about the datasets, methods, results, and open challenges in separate sections beginning from Section 3 and ending at Section 9. In Section 10, we provide details about the joint pretraining of vision and language, which is gaining momentum in recent years, that aims to solve multiple tasks at once using learned representations. It is then followed in Section 11 by potential future research directions. Finally, in Section 12, we conclude our survey and offer some insights.

2 Background

In this section, we first briefly introduce some of the standard tasks that are studied in computer vision and NLP separately. We then present how the tasks are modified such that they facilitate in designing ten prominent tasks for the integration of vision and language.

2.1 Computer Vision (CV) Tasks

An array of different tasks are studied in computer vision. Keeping in mind the underlying goal of computer vision is to describe and explain visual information, we divide these tasks based on where the visual data arises. In this survey, we mainly focus on images and videos as the visual information, although RGB-D and point cloud data are becoming prevalent.

2.1.1 Image as Visual Information

We describe two different aspects of the use of images in computer vision: (1) the tasks where images are used as input, and (2) the representation of images. In the following, we discuss various computer vision tasks that use images as input and present the recent progress and improvements made for representing image data.

Tasks.

The following tasks use images as input: (1) Image Classification (2) Object Localization (3) Object Detection (4) Object Segmentation (5) Object Identification (6) Instance Segmentation and (7) Panoptic Segmentation.

The fundamental difference between aforementioned tasks is that majority of them focus on carving out the exact position of visual object in an image, while rest of them provide a predefined class label for an image. There are also advanced tasks that use images as visual information and assist in the integration of computer vision and NLP. These tasks include (1) Image Style Transfer (2) Image Colorization and (3) Image Synthesis.

Representation.

The advent of deep learning (?, ?) has tremendously changed the field of computer vision. Convolutional Neural Networks (CNNs) (?) have become the de facto standard for generating representations of images using end-to-end trainable models.

There are several variations of CNNs that learn image features with supervised or self-supervised techniques (?). Most of these techniques are designed to learn transferable general image features by leveraging tasks presented earlier.

Commonly, transferable global image representations are learned with deep CNN architectures such as AlexNet (?), VGGNet (?), GoogLeNet (?), Inception-v3 (?), Residual Networks (ResNets) (?), DenseNets (?), and Efficient Net (?) using large datasets, viz. ImageNet111https://www.image-net.org (?), MSCOCO222http://cocodataset.org/#home (?), and Visual Genome333https://visualgenome.org (?). However, for some vision and language integration tasks, it is preferred to learn global image features during task-specific training as opposed to independently learning generic, pretrained representations.

For learning local features of objects that are typically represented with bounding boxes in images, the preferred choice is to utilize some region specific CNN architectures such as Region-based CNNs (R-CNN) (?). More recently, there is an interest in using self-attention based approaches, namely Transformers (?) for achieving end-to-end object detection (?).

2.1.2 Video as Visual Information

Similar to images, when a video is used as visual data, we need to consider two crucial aspects: (1) knowing the tasks where videos are used as inputs, and (2) the representation of a video. In the following, we discuss different tasks in computer vision that use video as input and further present the recent progress made in video representations.

Tasks.

Recently, the tasks on videos are also gaining importance, such as (1) Object Tracking (2) Action Classification (3) Emotion Detection (4) Scene Detection and (5) Automated Editing. The core difference between earlier tasks is that majority of them focus on tracking a visual object present in a scene of a video, while rest of them identify the task happening in a video such a action etc.

Representation.

To account for the temporal nature of videos, RGB images are stacked as frames to form a 4D representation (i.e., video). Usually, visual data observed in videos is extracted in the form of screenshots that are amenable to the same techniques for image local and global representation. However, in addition, spatio-temporal features are also developed with general video analysis such as C3D (?), or from action recognition datasets i.e., Kinetics action recognition (?) to build RGB-D or Inflated 3D ConvNet (I3D) features (?) using different CNN architectures.

2.2 NLP Tasks

Like in Section 2.1, the fundamental goals of most NLP tasks are to comprehend or generate language. In this section, we describe a few of the popular tasks that drive NLP research. We also discuss current approaches used to represent language.

Tasks.

The aim of NLP tasks is two-fold: i) understanding language, ii) generating language. Some of the classical NLP tasks, that are used to comprehend language, are shallow parsing, syntax parsing, semantic role labeling, named entity recognition, entity linking, co-reference resolution, etc. Tasks to generate language in a conditional or unconditional manner are machine translation, text summarization, etc.

Representation.

In deep learning based approaches, language is usually represented either as a bag-of-words or as distributed representations. For words in a sentence, initializations are commonly done with pretrained word embeddings (?, ?). Additionally, to represent variable-length text inputs, sequence learning techniques such as recurrent neural network variations like unidirectional Long Short-Term Memory (LSTM) (?), or bidirectional LSTM (BiLSTM) and unidirectional Gated Recurrent Units (GRUs) (?), or bidirectional GRUs (BiGRUs) are applied. Recently, to provide parallelization in sequential training, self-attention based approaches, viz. Transformers (?), have been employed to build architectures such as BERT (?) and its variations.

2.3 CV and NLP Integration Tasks

Over the past few years, significant progress has been made in the research concerning the integration of language and vision. Several tasks exist which combine language, observed at different levels (i.e., words, phrases, sentences, paragraphs, and documents), with visual information, typically represented as images or videos. Initially, most works concentrated on combining low-level linguistic units, such as words with images or videos for building visual-semantic embeddings (?, ?, ?, ?, ?, ?, ?, ?, ?, ?), which are beneficial for downstream applications, as well as understanding adversarial attacks (?) to improve model robustness.

However, it will be appealing to look into those tasks that go beyond words and consider variable-length texts larger than words as language input. Most of these tasks can be seen as an extension to either CV, NLP, or both. Figure 1 provides an illustration of different tasks and their groupings.

To get a grasp on how these tasks are perceived as a natural extension of tasks in computer vision, NLP, or both, we briefly describe their relation with similar tasks addressed in their research.

Extension of NLP Tasks

•

Visual Description Generation is closely related to conditional language modeling (?) or the Natural Language Generation (NLG) (?) tasks in NLP. Given non-linguistic information (e.g., image or video), the goal is to generate a human-readable text snippet that describes the input.

•

The task of Visual Storytelling solves a similar problem to visual description generation. However, instead of dealing with a single visual input, a sequence of visual inputs is used to generate a narrative summary based on the text aligned with them. It can be seen that the task is closely aligned to text summarization (?, ?), mostly generating abstractive summaries.

•

Visual Question Answering draws its inspiration from text-based question answering (?, ?), which is one of the long-standing topics in NLP research. Here, answering questions about a visual input is perceived as its natural extension.

•

The task of Visual Dialog aims at creating a meaningful dialog in a natural and conversational language about a visual content. It is perceived as the visual analogue of the text-based dialog and conversation system (?, ?, ?) that has been explored in NLP over many years.

•

Visual Referring Expression is an extension of referring expression (?) in natural language generation systems. Also, the sub-problem in visual referring expression (i.e., comprehension) is perceived as an analogy of pragmatics in linguistics (?) due to its usage of context.

•

Visual Entailment is an inference task for predicting whether the image semantically entails the text. The task has been proposed as a natural extension to natural language inference (?, ?), where the premise is text, instead of a visual content.

•

The goal in Multimodal Machine Translation is to achieve translation from source language(s) to target language(s) by leveraging the visual information as auxiliary modality along with the natural language text in source language(s). It is influenced by the well-known NLP task of machine translation that aims to automatically translate textual contents between any two natural languages (?, ?).

Extension of CV Tasks

•

Visual Generation deals with the generation of visual content by conditioning on input text from a chosen natural language. It can be perceived as a multimodal extension of the popular computer vision tasks of image-to-image translation (?) and neural style transfer (?).

•

The task of Visual Reasoning is a direct extension of visual perception where standard computer vision tasks such as image classification (?), object detection (?), or semantic segmentation (?) are performed. Instead of providing only class labels (in case of classification), bounding boxes (in case of detection), or segments (in case of segmentation), visual reasoning is expected to output a relationship between detected objects by generating an entire visual scene graph. Furthermore, the scene graph is leveraged to reason and answer questions about visual content. It can also be used to reason about whether a natural language statement is true or not regarding a visual input (?).

Extension of both NLP and CV Tasks

•

Vision-and-Language Navigation is one task that can be seen as a transition from standard vision-based navigation using only visual input (?, ?) or natural language instruction based navigation (?, ?). The expectation here is that the natural language navigation instruction should be interpreted based on visual input. Hence, it combines both vision and language.

Representation.

In earlier sections, we discussed different architectures used to represent both vision and language separately. Combining representations of language and vision is essential to address vision and language integration tasks in an effective manner. There are various models that have been proposed for each task to build representations that integrate vision and language. We discuss these in greater detail in forthcoming sections where each of the tasks are introduced.

2.4 Summary

In background section, we have reviewed a variety of tasks that integrate vision and NLP. Additionally, we explored diverse methods that are used for the representation of vision and language modalities. Furthermore, we understood the training procedure used by different methods that use supervised learning. For example, models built using those methods leverage first-order optimization algorithms such as Stochastic Gradient Descent (SGD) (?), ADAM (?) or RMSProp (?). While, some methods also utilize Reinforcement Learning (RL) (?) in contrast with only supervised learning.

We will see that many of the models developed for these tasks use similar architectures for the representation of vision and language modalities and depend on standard gradient-based optimization algorithms for training. This shows that, although the aims of each task are different, the underlying principles to extract meaning from unstructured data remain the same.

3 Visual Description Generation and Storytelling

In this section, we explore two different tasks, Visual Description Generation and Visual Storytelling. Although the goals of these tasks do not perfectly line up, they share the common intention of generating a textual description when conditioned on visual input. In the following, we present more details about each of these tasks separately.

3.1 Visual Description Generation

The aim in description generation is to generate either a global description or dense captions for a given visual input. Depending on the type of visual input, i.e., either an image or a video, there are various ways to explore the problem.

3.1.1 Image Description Generation - Introduction

There are many subareas of image description generation where the underlying goal of generating global or dense descriptions remains the same, but the way those descriptions appear is different. In the following section, we explore some of the popular categories observed in image description generation.

Standard Image Description Generation.

The goal in standard image description generation is to generate a sentence-level description of the scene in a given image. Here, methods leverage only vocabulary of the dataset to generate the best description that depicts the scene in the image. Figure 2 provides a conceptual representation of the task.

Initially, several methods were developed based on templates, n-grams, and dependency parsing (?, ?, ?, ?, ?, ?, ?). Recently, however, image description generation models based on the encoder-decoder a.k.a. Sequence-to-Sequence (Seq2Seq) frameworks (?, ?) have become popular. Moreover, the above said frameworks have been extended with attention mechanisms (?) to support the selection of local image features that are useful for the generation of words at each time step (?). Table 1 summarizes different setups for generating image descriptions using neural network based non-attention, attention, and reinforcement learning techniques. Other variations include cross-lingual image captioning (?) and multi-language image description generation (?).

In the following, we explore some of the related ideas that expand the scope of image description generation.

Dense Image Description Generation.

Dense image description generation task aims to create descriptions at the local object-level in a given image. It is referred to as dense captions since the commonly used image datasets have images containing multiple objects. Several approaches (?, ?, ?, ?) have been proposed to generate dense captions in images. Usually, they use representations of phrases and their relationships to generate descriptions (?).

Image Paragraph Generation.

The aim in image paragraph generation is to create paragraphs instead of generating a single simple description, or dense descriptions for an image. Generated paragraphs are expected to be coherent and contain fine-grained natural language descriptions (?, ?, ?).

Spoken Language Image Description Generation.

Spoken language image description generation expands the description generation task to work with spoken language, instead of limiting to only the written forms of language. Investigations such as visually grounded speech signals (?) address the standard image description generation task from the perspective of a spoken language.

Stylistic Image Description Generation.

Stylistic image description generation adds styles to the standard image description generation, where the generated descriptions adhere to a specific style. For example, ? (?) generated captions which capture the sentiments of an image, while ? (?) attempted at generating humorous and romantic captions. In addition, this task has been extended by leveraging unpaired textual corpora (?) to generate story-like captions. Furthermore, to make the generated captions more human-like, personality traits have been used to generate captions (?). Recently, multi-style image description generation (?) has been explored, in which a single model using unpaired data is built to generate different stylized captions.

Unseen Objects Image Description Generation.

Unseen objects image description generation leverages images which lack paired descriptions. Most of the paired image-description datasets have few visual objects to represent. Hence, methods such as Deep Compositional Captioning (DCC) (?), Novel Object Captioner (NOC) (?), Constrained Beam Search (CBS) (?), and LSTM-C (?) address the challenge of generating descriptions for these images. They generate descriptions for visual object categories that are previously unseen in image-description corpora, either by transferring information between seen and unseen objects before inference (i.e., before test time), or by keeping constraints on the generation of description words during inference (i.e., during test time). A few approaches (?, ?) have transferred information both before and during inference. Recently, pointing LSTM was designed to point to the novel objects (?) by balancing generation and copying of words. Nevertheless, earlier approaches work only with a limited set of objects. To address this issue, a large-scale nocaps dataset (?) was created.

Diverse Image Description Generation.

Diverse image description generation task aims to incorporate variety and diversity in the generated captions. A few approaches (?, ?) have leveraged adversarial training, while ? (?) used diverse beam search to decode diverse image captions in English. Approaches have also been proposed to describe cross-domain images (?).

Controllable Image Description Generation.

Controllable image description generation task focuses on selecting specific objects in an image, defined by a control signal, to generate descriptions. Initially, ? (?) generated layouts from images, while ? (?) counted image objects to produce multiple captions for a given image. Additionally, a control signal has been used to make the image captioning process more controllable, and also to generate diverse captions. ? (?) used either a sequence or a set of image regions. Also, chunks of the generated sentences were explicitly grounded on regions. Moreover, instead of making captions only diverse, there were also attempts to make the generated descriptions more accurate (?).

Image Caption Emendation as Generation.

Caption emendation task is a variant of caption generation where the aim is to build a model to emend (a.k.a. edit or correct) both syntactic and semantic errors in the captions. There has been a lot of interest in recent years on this emerging topic of research. ? (?) proposed Show, Tell, and Polish framework to better mimic humans in sentence constructions. That is, coming up with a first version and then keep polishing it until it feels right. The core idea of this architecture is to perform a two-pass decoding, instead of the typical single-pass decoding. Thus, the model contains two decoder modules, viz., base decoder and ruminant decoder, whereby the base decoder generates a first version of caption which then feeds into the ruminant decoder for refinement (a.k.a. polishing). Along the same lines, ? (?) introduced fusion models for caption emendation, which is a generic fusion model framework containing a standard encoder-decoder format image captioning model, a pretrained auxiliary language model (AuxLM - BERT MLM), and a fusion module component that fuses language-only representations of AuxLM and visual-linguistic representations of decoder using different fusion techniques. The intuition behind introducing an external language model trained on a large-scale language corpora is to capture world knowledge and rich linguistic features, which are both scarce in annotated captions data, in an attempt to generate fluent and accurate descriptions. In both of the above approaches, emendation is achieved by generating a caption while utilizing the baseline caption as a reference. That is, the model is trained to correct any errors and incongruencies in the baseline caption. Likewise, ? (?) propose Show, Edit, and Tell framework as an iterative adaptive refinement approach that utilizes attention LSTMs and denoising autoencoders for correcting captions.

3.1.2 Image Description Generation - Datasets

A wide range of datasets are available for conducting research in integration of vision and language. In fact, they are one of the main driving forces behind recent accelerated advancements that we are witnessing in this field (?). Visual information associated with textual content in these datasets differ from each other in many aspects such as size, quality, and the way in which they are collected. In our survey, we summarize the characteristics of these datasets and provide basic statistics about them. However, we do not furnish a deeper analysis of them, as this was already done by ? (?).

An array of diverse datasets, both of small and large-scale, were created and made available publicly in the past decade to address the challenge of image description generation. Some of the early large-scale datasets focus on image captions, while the others are only of small- or medium-scale. In the following sections, we cover only those datasets that are extensively used in the image captioning literature.

SBU Captioned Photo Dataset (SBU1M).

SBU1M444http://vision.cs.stonybrook.edu/~vicente/sbucaptions (?) is an automatically collected image description dataset that uses query terms to retrieve images and associated text from Flickr555https://www.flickr.com. This web-scale dataset is distributed as a single plain text file containing 1 million URLs of Flickr images and their corresponding captions. Although one of the older datasets in image description research, it has been rarely used in recent years. Table 2 provides basic statistics about this dataset.

Flickr8k.

As with SBU1M, images in the Flickr8k666http://hockenmaier.cs.illinois.edu/8k-pictures.html (?) dataset are also retrieved from Flickr ${}^{\ref{fnote: flickr-url}}$ . However, unlike the automated way of collection of SBU1M, the images in Flickr8k are selected through user queries for specific objects and actions using the Amazon Mechanical Turk (AMT) platform. The images are then captioned by annotators on AMT such that each image contains five captions that are independently created. Table 3 presents the so-called *karpathy split777https://cs.stanford.edu/people/karpathy/deepimagesent * of the dataset.

Flickr30k.

Flickr30k888http://hockenmaier.cs.illinois.edu/Denotation.html (?) is an extended version of the previously published Flickr8k dataset, containing images collected from Flickr ${}^{\ref{fnote: flickr-url}}$ and captions obtained via crowdsourcing using AMT platform, following the same strategies employed in Flickr8k. Table 4 presents the previously-mentioned karpathy split ${}^{\ref{fnote: karpathy-split-url}}$ of the dataset.

Flickr30k-Entities.

Flickr30k-Entities999http://bryanplummer.com/Flickr30kEntities (?) extends Flickr30k with manually annotated bounding boxes for images and entity mentions in the captions in order to accomplish the task of language grounding in images, viz. phrase localization, while performing captioning. Specifically, there are 275,775 bounding boxes for the images of Flickr30k and 513,644 entity mentions in the 158k captions of Flickr30k. One peculiarity of this dataset is that it comes with 244k co-reference chains, in which each chain is a link between the mentions of the same entities across the five different captions of a given image. Some statistics and karpathy split ${}^{\ref{fnote: karpathy-split-url}}$ of this dataset is presented in Table 5.

MSCOCO.

MSCOCO ${}^{\ref{fnote: mscoco-dataset-url}}$ (?) is a widely-used and considerably larger-scale dataset than the image captioning datasets discussed so far. It contains natural images that are collected from Flickr ${}^{\ref{fnote: flickr-url}}$ . The AMT platform is then used to curate and collect descriptions for the images. This dataset does not have an official split, hence the karpathy split ${}^{\ref{fnote: karpathy-split-url}}$ from the above datasets is commonly used in the vision and language research community. The statistics and splits of the dataset can be found in Table 6.

MSCOCO-Entities.

MSCOCO-Entities101010https://github.com/aimagelab/show-control-and-tell (?) is a recently-introduced dataset based on the original MSCOCO (?) dataset, with the goal of achieving the twin challenges of grounding and controllability in generated image captions. Unlike Flickr30k-Entities, the grounding annotations in this dataset are obtained in a semi-automated way. Table 7 presents some statistics about the dataset as well as its split.

STAIR Captions.

STAIR Captions111111http://captions.stair.center (?) is a large-scale Japanese image captioning dataset that provides Japanese language descriptions for the 164,062 images of MSCOCO, while retaining the same dataset splits, viz. karpathy split ${}^{\ref{fnote: karpathy-split-url}}$ , as with MSCOCO (see Table 6). The annotation of captions is done manually using crowdsourcing. Original statistics from the authors of the dataset is provided in Table 8.

Multi30k-CLID.

The Multi30k-CLID121212https://www.statmt.org/wmt16/multimodal-task.html (?) dataset was designed for the task of Cross-Lingual Image Description (CLID) generation with an ultimate goal of pushing existing vision and language research towards multilingual multimodal language processing. In the first edition of the task in 2016, the Flickr30k-Entities ${}^{\ref{fnote: flickr30k-entities-dataset-url}}$ dataset (?) was extended to the German language by crowdsourcing the descriptions independently from their English language counterparts with the help of professional translators. As with original Flickr30k, each image comes with five descriptions in German. Hence, the English-German pairs are considered as comparable, though not parallel, corpora. The splits of this dataset for English and German languages can be found in Table 9.

In the second version131313https://www.statmt.org/wmt17/multimodal-task.html of the task in 2017, the Flickr30k-Entities ${}^{\ref{fnote: flickr30k-entities-dataset-url}}$ dataset was further extended to support French language captions (?). The annotations were again obtained via crowdsourcing following the same principles as with the previous version. Table 10 presents the number of instances in each language and the splits of the dataset.

Similar to the earlier editions of the task, in the 2018 version141414http://www.statmt.org/wmt18/multimodal-task.html Czech language translations of the captions were added (?). Following the same strategy of the prior versions of this dataset for obtaining annotations, human translators were employed to produce Czech translations for the captions of Flickr30k-Entities ${}^{\ref{fnote: flickr30k-entities-dataset-url}}$ . Table 11 presents splits and statistics of all four languages of the dataset.

Conceptual Captions (CC).

Conceptual Captions151515https://ai.google.com/research/ConceptualCaptions/download (?) is a recently introduced web-scale dataset containing more than 3.3M images paired with English language captions. The dataset was harvested from the web in an automatic manner in which the captions were extracted from the alt text of retrieved HTML webpages. As a consequence, contrary to other curated image captioning datasets in which each image is paired with five captions, the images in CC have only one description, a fact that is evident in Table 12 which also presents the dataset splits.

Although it is a large-scale dataset with a wide variety and style in captions, continued availability of the dataset for downloading by future users is a major issue, primarily due to the fact that the dataset has been distributed as a CSV file containing URLs of images. Thus, it inherently suffers from the problem of URLs becoming stale (for instance due to contents being removed, unresponsive HTTP requests, etc.), which puts the dataset at a disadvantage.

Personality Captions (PC).

Personality Captions161616https://parl.ai/projects/personality_captions (?) is a large scale image caption dataset that comes with so-called personality traits that are useful for controllable and style-based image captioning. Thus, the samples in the PC dataset are provided as triplets (image, personality trait, caption). Basic statistics such as vocabulary size, including the dataset splits, is provided in Table 13.

3.1.3 Image Description Generation - Evaluation Measures, Models, and Results

In this section, we describe only the evaluation measures which are used for the task of Image Description Generation, as Models, Results, and some Discussion have been broadly presented in recent surveys (?).

Evaluation Measures.

We divide the evaluation measures into three different categories. The first set of measures is “Language Metrics”, the second category is about “Retrieval Metrics”, and the third category denotes “Human Evaluation”.

“Language Metrics” evaluate machine-generated text based on reference text by computing similarity scores using simple n-gram statistics and word overlaps.

•

Bilingual Evaluation Understudy (BLEU) (?) was originally developed for machine translation to compare machine generated output with human Ground Truth (GT). BLEU calculates the overlap between predicted unigrams (BLEU-1 (B-1)), or, more generally, n-grams (BLEU-2 (B-2) with bigrams, BLEU-3 (B-3) with trigrams, BLEU-4 (B-4) with quadrigrams, and so on.) from the set of candidate and reference sentences. To achieve a high BLEU score, generated descriptions should match the human GT words as well as their order. Maximum achievable BLEU score is 1.0 (or sometimes, equivalently 100), which is obtained when an exact match occurs between generated and reference sentence.

•

Metric for Evaluation of Translation with Explicit Ordering, popularly known as METEOR (?) has overcome some issues of BLEU, such as the need for exact word matching. Instead of a literal token matching, METEOR rather performs semantic matching by leveraging WordNet to match words at various levels, using synonymy and paraphrase matching. The METEOR score is then computed using the alignment between the machine generated output and the corresponding reference sentences. To be more specific, initially, the set of unigrams from the generated and reference sentences is used to perform an alignment. If multiple options are available for alignments between the generated and reference sentence, the alignment setting with least comparisons is preferred. After finalizing the alignment process, the METEOR score is calculated.

•

Recall Oriented Understudy for Gisting Evaluation (ROUGE) (?) was designed to evaluate textual summaries. As opposed to BLEU, which concentrates on n-gram precision, ROUGE instead calculates the recall score of the generated sentences corresponding to the reference sentences. The most prominent ROUGE variant used is ROUGE-L, which is based on the longest common subsequence. Other variants include ROUGE-W (Weighted Longest Common Sub-sequence) and ROUGE-S (Skip-Bigram Co-Occurrences Statistics). One advantage of ROUGE-L over BLEU and METEOR is that it checks for subsequences within a sentence. Moreover, specifying the n-gram length (as required in BLEU) is not necessary as it is automatically incorporated.

•

Consensus-based Image Description Evaluation (CIDEr) (?) evaluates the consensus between a generated sentence and a set of reference sentences by performing different language pruning techniques, such as stemming and building a set of n-grams. N-grams that are common among the reference sentences of all visual data are given lower weight, as they are less informative about the visual content, and biased towards the textual content of the sentences. The weight for each n-gram is computed using Term Frequency (TF) - Inverse Document Frequency (IDF) (TF-IDF), where TF puts higher weight on frequently occurring n-grams in the reference sentence of the visual content, whereas IDF puts lower weight on commonly appearing n-grams across the whole dataset. To remove the mismatch between human evaluation and CIDEr scores, a variant of CIDEr, CIDEr-D, is used. It adds small variations, such as not performing stemming and ensuring that the words with high confidence are not repeated in a sentence by introducing a Gaussian penalty over length differences between the generated and reference sentences. As in the case of vanilla CIDEr, it produces high scores even if the sentences do not make sense.

•

Semantic Propositional Image Captioning Evaluation (SPICE) (?) measures the similarity between the scene graph tuples parsed from generated sentences and human created GT sentences. The scene graph encodes objects and their relationships through dependency parsing. Hence, it makes SPICE heavily dependent on parsing, which can be prone to errors. Similar to METEOR, SPICE uses WordNet to find and treat synonyms as positive matches when computing the F1 score between the tuples of generated sentences and the ground truth.

“Retrieval Metrics” evaluate the machine generated text based on standard information retrieval measures (?) and are presented in the following paragraphs.

•

Recall@k (R@k)’s goal is to evaluate the number of relevant ground truth sentences retrieved in the Top-k (e.g., Top-1, Top-5 etc.) candidates. A higher R@k indicates better performance.

•

Median Rank (MedRank) finds the median rank value of the retrieved ground truth. A lower MedRank value indicates better performance.

•

Mean Reciprocal Rank (MRR) is a binary measure, where the rank of the highest ranking relevant document for a query is used to calculate the reciprocal rank averaged over all queries. A higher MRR indicates better performance.

•

Mean Rank (Mean) refers to the mean rank achieved in retrieving the relevant sentence. A lower Mean value is better.

•

Normalized Discounted Cumulative Gain (NDCG) is a variant of Discounted Cumulative Gain (DCG) (?). NDCG is a cumulative, multilevel measure of ranking quality that is usually truncated at a particular rank level.

“Human Evaluation” employs crowd-workers to evaluate the quality of the generated content and is described in the following paragraph.

•

Human Evaluation The earlier discussed metrics provide only quantitative measures for evaluating different tasks. Due to the lack of high correlation between machine-generated textual or visual data with the human provided GT, most of the tasks, however, require human evaluations to judge the quality of the generated content. Therefore, based on the task, various kinds of instructions are given to humans who act as an evaluator in the evaluation study. In most tasks, we are interested only in finding relevance of the output to input.

3.1.4 Video Description Generation - Introduction

Going beyond images, the goal in video captioning is to comprehend the spatio-temporal information in a video for the purpose of generating either a single or multiple textual descriptions. As with image description generation (Section 3.1.1), we explore some of the popular types and categories of video description generation tasks in the following.

Global Video Description Generation.

Global video description generation approaches (?, ?) initially started by grounding sentences that describe actions in the visual information extracted from videos. It was further expanded into generating global natural language descriptions for videos with various approaches, for example, leveraging latent topics (?), corpora knowledge (?), graphical models (?), and sequence-to-sequence learning (?, ?, ?, ?, ?, ?, ?). Figure 3 depicts the description generation task for a complete video.

The aforementioned approaches leverage only those training datasets with a limited set of visual objects. However, the recognition and description of entities and activities in real-world videos is more difficult. Nevertheless, generating natural language descriptions for such videos is addressed with a factor graph by combining visual detection with language statistics (?).

Additionally, sequence-to-sequence (seq2seq) based approaches have been improved with external corpora (?) and also using attention with various techniques such as soft-attention (?), multimodal fusion (?), temporal attention (?), semantic consistency (?), and residual connections (?). Apart from attention-based methods, novel architectures have also been explored, such as incorporation of semantic attributes learned from videos (?), ensemble-based description generator networks (?) and encoder-decoder-reconstructors which leverage both the forward and backward flows, i.e., video-to-description and description-to-video, for video captioning (?). Multi-faceted attention has also been used to select the most salient visual features or semantic attributes, with which an overall sentence is generated (?).

Apart from architecture improvements, different machine learning approaches have also been explored. Video captioning has been tackled using a multi-task learning scenario by sharing knowledge between two related tasks (such as temporal- and context-aware video) combined with entailment generation task (?). Other approaches have leveraged reinforcement learning, either by providing entailment rewards (?) , or to address the description generation for multiple fine-grained actions (?). Further, ? (?) proposed a deep network designed to detect inaccuracies in a sentence, and fix them by replacing the inaccurate word(s) with the help of a Visual Text Correction system. Recently, Zhang et al. ? (?) introduced an object relational graph (ORG) based encoder which encapsulates the relation among visual objects to build richer representation and a decoder the integrates the external language model to capture abundant linguistic knowledge for efficient video description generation.

In the following, we discuss some related ideas which expand the scope of video description generation.

Dense Video Description Generation.

The aim of dense video description generation is to achieve fine-grained video understanding by addressing two sub-problems: (1) localizing events in a video, and (2) generating captions for these localized events (?, ?). Further, extending earlier research, some approaches (?) have explicitly linked the sentence to a corresponding bounding box in one of the frames of a video by annotating each of the noun phrases observed in the sentence. Incorporating background knowledge for video description generation is also another line of research (?). However, the core challenge, namely the automatic evaluation of video captioning, is still unsolved. It is currently being studied from the perspective of direct assessment with the help of human assessors (?).

Movie Description Generation.

Movie description generation perceives the video description generation task from a different perspective, in which movie clips are used as inputs. Initially, aligning books to movies (?, ?) was used to generate story-like explanations. Later, movie descriptions (?) were directly created by transcribing audio descriptions by concentrating on precisely describing what is shown in the movie scenes.

3.1.5 Video Description Generation - Datasets

Similar to the image description generation task, several datasets have been created to address the task of video description generation. In the following, we cover those datasets that are popular and extensively used. For the sake of brevity, we denote hours $\rightarrow$ h, minutes $\rightarrow$ m, and seconds $\rightarrow$ s.

Microsoft Video Description (MSVD).

MSVD171717https://www.cs.utexas.edu/users/ml/clamp/videoDescription (?) is an open domain dataset collected from YouTube clips and annotated using AMT. The dataset is multilingual and contains human generated descriptions in languages such as German, English, Chinese, etc. On average, there are forty-one single sentence descriptions per clip. More statistics about the dataset are presented in Table 14 whereas Table 15 presents its split.

MPII Cooking Activities.

The MPII Cooking181818https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/human-activity-recognition/mpii-cooking-activities-dataset (?) dataset consists of 65 different cooking activities such as “wash hands”, “put in bowl”, etc., when participants are preparing one of 14 dishes such as fruit salad, casserole, etc. The dish preparation time ranges between 3 and 41 minutes. The videos are recorded in high resolution (1624x1224), following which the activity annotations are manually created by 6 people. Table 16 presents more statistics about the dataset whereas the splits of it can be found in Table 17.

YouCook.

YouCook191919http://web.eecs.umich.edu/~jjcorso/r/youcook (?) is a more complex real-world cooking dataset when compared to MPII Cooking in which the complexity arises because of dynamic scene and camera changes. The videos are all downloaded from YouTube and are broadly categorized into 6 different cooking styles, viz. baking, grilling, etc. Video descriptions are obtained via crowdsourcing using AMT. On average, eight descriptions are collected per video. Frames are annotated with objects belonging to categories (such as bowls, utensils, etc.) and actions. More details and splits of the dataset can be found in Table 18 and Table 19 respectively.

YouCook II.

Similar to the YouCook dataset, YouCook II202020http://youcook2.eecs.umich.edu (?) also consists of instructional cooking videos that are all collected from YouTube. The videos include 89 cooking recipes from four regions: South Asia, East Asia, Europe/Middle East, and America. One unique aspect of this dataset when compared to previously discussed video description datasets is that that the videos are annotated with procedure segments that contain rich semantic information. Table 20 presents the statistics about the dataset.

For each recipe, the videos are randomly split into training, validation, and testing in ratios of 67%, 23%, and 10% respectively. The actual numbers are presented in Table 21.

Textually Annotated Cooking Scenes (TACoS).

The TACoS212121https://www.coli.uni-saarland.de/projects/smile/page.php?id=tacos (?) dataset is an extended version of a subset of MPII Composites (?) which contains cooking videos that are each annotated with multiple textual descriptions. It contains only those videos that include activities such as manipulation of cooking ingredients. Around 26 cooking activities are collected with 127 videos. More statistics on the dataset is presented in Table 22 and Table 23. For building and evaluating models, the dataset is split into 50% for training, 25% for validation, and 25% for testing.

TACoS-MultiLevel.

The above discussed TACoS dataset was extended into TACoS-MultiLevel222222https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus (?) by collecting three levels of descriptions constituting (i) 15 detailed descriptions per video, (ii) 3-5 short descriptions, and (iii) a single sentence description, using AMT platform. Overall, the dataset comes with 2,600 triplets of descriptions. Further statistics on the dataset can be found in Table 24.

MPII Movie Description (MPII-MD).

MPII-MD232323https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/mpii-movie-description-dataset (?) dataset contains clips extracted from Hollywood movies and their transcribed audio descriptions. In addition, each clip is paired with a single sentence that is extracted from the script of the movie. Furthermore, transcribed audio is associated with spoken sentences by using timestamps. Misalignment between the audio and visual content is handled by leveraging manual annotation. Additional statistics on the dataset is presented in Table 25.

For the task of video description, the MPII-MD dataset is split as follows: 11 movies with associated scripts and audio descriptions (in total 22 alignments, 2 per movie) are used as validation (8) and test sets (14). The remaining 83 movies are used for training purposes.

Montreal Video Annotation Dataset (M-VAD).

M-VAD242424https://mila.quebec/en/publications-archive/public-datasets/m-vad/ (?) is a large Descriptive Video Service (DVS)-derived video dataset that is created using 92 Movies, covering a wide variety of genres. It is collected in a semi-automatic manner with minimal human intervention. The words in the descriptions are annotated with Part-Of-Speech (POS) tags using the Stanford POS tagger. Around 500 proper names are removed from the corpus, since learning proper names is not interesting for a video description model.

Table 26 presents some statistics about the dataset, while Table 27 presents the official dataset split that balances the genre within each split.

MSR Video to Text (MSR-VTT).

MSR-VTT252525http://ms-multimedia-challenge.com/2017/dataset (?), also known as MSR-VTT-10k, is a large-scale video dataset containing automatically crawled videos belonging to 20 categories for the task of video description generation. The sentence annotations are obtained via crowdsourcing using AMT. In addition to the video content, the dataset also contains audio information. Table 28 presents more statistics about the dataset.

Out of 7.2k videos, 30k video clips have been created. However, only a random subset of 10k clips has been released. The dataset is split in the ratio of 65%:30%:5% for training, validation, and testing. Specific numbers are presented in Table 29.

Videos Titles in the Wild (VTW).

VTW262626http://aliensunmin.github.io/project/video-language/index.html#VTW (?) is a large-scale dataset of automatically crawled user-generated YouTube videos paired with titles and descriptions. The video clips are on average 90 seconds in duration and are described with one sentence per clip to enable video title generation. It also comes with augmented sentences that contain information that may not be present in the video clip. More statistics of the dataset can be found in Table 30.

Similar to M-VAD, the dataset is randomly split into 80% for training and 10% each for validation and testing. Specific numbers are presented in Table 31.

ActivityNet Captions (ANetCap).

ANetCap272727http://activity-net.org/challenges/2017/captioning.html (?) is a large-scale video dataset282828https://cs.stanford.edu/people/ranjaykrishna/densevid that extends a subset of videos from ActivityNet with dense descriptions. There are multiple descriptions for every video and the videos contain multiple events occurring at the same time. Another notable aspect of this dataset is that the descriptions focus more on actions happening in videos. As a result, this dataset falls under the category of being more action-centric than object-centric.

Table 32 presents more statistics on the dataset, while Table 33 presents its split.

ActivityNet Entities (ANetEntities).

The ANetEntities292929https://github.com/facebookresearch/ActivityNet-Entities (?) dataset augments ANetCap (?) with manually annotated bounding boxes, and was created for the task of grounding language in videos while generating descriptions. It adds around 158k bounding box annotations on ANetCap, each grounded to a Noun Phrase (NP) in the sentence description. More statistics and the dataset splits can be found in Table 34.

COmprehensive INstructional video analysis (COIN).

COIN303030https://coin-dataset.github.io (?) is a large-scale dataset of instructional YouTube videos from 12 domains such as vehicles, gadgets, sports, etc., that are common in our daily lives. It is aimed at overcoming two limitations of current instructional video datasets, namely diversity and scale. It covers over 180 tasks in 12k videos.

One unique aspect of this dataset is that it introduces a three-level hierarchy, viz. domain, task, and step, for organizing videos. Table 35 shows some statistics of the dataset whereas Table 36 presents training and validation splits of COIN.

HowTo100M.

HowTo100M313131https://www.di.ens.fr/willow/research/howto100m (?) is a large-scale dataset of narrated videos with emphasis on instructional YouTube videos where the video creators teach complex tasks with an explicit intention of explaining the visual content on screen. The dataset includes a wide variety of 23k activities from the domains such as gardening, personal care, fitness, hand crafting, cooking, etc. and is three orders of magnitude than the previously discussed video description datasets. Table 37 presents more statistics about the dataset.

This dataset has not yet been used for the task of video description generation. Hence, an official dataset split is not available for evaluation purposes.

3.1.6 Video Description Generation - Evaluation Measures, Models, and Results

In this section, we describe only the evaluation measures which are used for the task of Image Description Generation as Models, Results, and some Discussion have been broadly discussed in recent surveys (?).

Evaluation Measures.

The measures used for Video Description Generation are the same as the Language metrics and Retrieval metrics used in Image Description Generation and are presented in the Section 3.1.3.

3.2 Visual Storytelling

The task of visual storytelling aims to encode a sequence of images or frames (in the video) to generate a paragraph which is story-like. This is usually considered more beneficial than generating a paragraph from a single image or video.

3.2.1 Image Storytelling - Introduction

The aim of image storytelling is to generate stories from a sequence of images. Although sequence of images can be perceived as a video, consecutive images in the streams can have sharp changes of visual content, which can cause an abrupt discontinuity between consecutive sentences (?). Hence, it is seen as a sequential vision-to-language task (?) where images are not considered in isolation. Figure 4 shows a schematic representation of image storytelling where a story in a sequence is generated.

Initially, semantic coherence in a photo stream is captured by reducing the visual variance. Further, the semantic space is acquired by jointly embedding each photo with its corresponding contextual sentence such that their correlations are discovered (?). It was then improved by exploiting hierarchical architecture (?) and further optimized by incorporating reinforcement learning with rewards (?) for generating relevant and expressive narrative paragraphs. Instead of flat deep reinforcement learning, a hierarchically structured reinforced training has also been studied (?) and has been shown to achieve significantly better performance than with a flat structure. Similarly, ? (?) used adversarial reward learning to learn an implicit reward function from human demonstrations to optimize policy search with the learned reward function.

Nevertheless, the standard form of narration suffers from repetitiveness, with the same objects or events serving to undermine a good story structure. Hence, inter-sentence diversity was explored with diverse beam search to generate more expressive stories (?). The task has also been approached from a different perspective, in which, given a jumbled set of aligned image-description pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story (?).

While earlier research addresses only natural images, some approaches (?) also incorporated medical domain knowledge to generate realistic and accurate descriptions for medical images.

3.2.2 Image Storytelling - Datasets

There are not many datasets created to address the creative task of image storytelling. In the following, we cover all datasets that have been used to advance this artistically interesting and challenging problem.

New York City Storytelling (NYC-Storytelling).

The NYC-Storytelling323232https://github.com/cesc-park/CRCN (?) dataset was created from blogs in which users post their travelogues. The dataset is collected in a semi-automatic manner: automatic crawling followed by manual selection of travelogues and finally preprocessing using the NLTK333333https://www.nltk.org library. For evaluation purposes, the dataset is split in a ratio of 8:1:1 for training, validation, and testing respectively. Table 38 presents minimal statistics of the dataset.

Disneyland Storytelling.

Similar to NYC-Storytelling, Disneyland Storytelling is also based on blogs documenting travelogues but specifically about Disneyland Park. This dataset was originally created by (?) but has been reused for visual storytelling tasks. The same ratio of data splits as with the NYC-Storytelling dataset is used for evaluation purposes. The minimal statistics of the dataset can be found in Table 39.

Sequential Image Narrative Dataset (SIND).

SIND (?) is the first large-scale dataset created for the task of image storytelling. Natural language descriptions of the dataset are divided into three types: (i) Descriptions of Images-in-Isolation (DII), (ii) Descriptions of Images-in-Sequence (DIS), and (iii) Stories for Images-in-Sequence (SIS). The stories are collected via crowdsourcing using AMT. Similar to other image storytelling datasets, this dataset is split into 80%, 10%, and 10% for training, validation, and testing purposes respectively. Table 40 presents the statistics of the dataset.

Visual Storytelling Dataset (VIST).

VIST343434http://visionandlanguage.net/VIST is the second version (v.2) of SIND (see Section 3.2.2) and is aimed at modeling the social language of humans for evolving AI to be more human-like in understanding. Basic statistics of the dataset are shown in Table 41 while the splits of it can be found in Table 42.

3.2.3 Image Storytelling - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different Image Storytelling models and the results obtained by them.

Evaluation Measures.

To evaluate Image Storytelling models, the Language metrics and Retrieval metrics presented in Section 3.1.3 are used.

Models.

Many models have been created in attempts to solve the Image Storytelling task. In Table 43, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both image and language inputs. We also include a column that showcases the optimization techniques used to train those models.

Results.

In Table 44, Table 45, Table 46, and Table 47 we present the results obtained with a subset of models which use the datasets presented earlier in Section 3.2.2.

3.2.4 Image Storytelling - Discussion

We observe that for Image Storytelling, the adversarial approach, i.e., Adversarial REward Learning (AREL) proposed by ? (?), achieves best results on both retrieval and language metrics for different datasets. This attests to AREL’s ability to clone expert behaviors while still generating more human-like stories.

3.2.5 Video Storytelling - Introduction

In comparison to image storytelling, which only deals with a small sequence of images, the aim of video storytelling is to generate coherent and succinct stories for long videos. However, video storytelling is less explored. The video storytelling task was pioneered by ? (?) to address challenges such as diversity in the story and the inherent complexity of video. They introduced residual Bidirectional RNNs (BiRNNs) for leveraging context and a narrator model with reinforcement learning. Further, ? (?) created a multi-sentence video description dataset (VideoStory) to resemble stories from social media videos. The goal of social media-specific video description generation was to offer support to people with visual disabilities or other technical issues such as internet bandwidth limitations. Figure 5 illustrates the task of video storytelling where a story in a sequence is generated based on a video as the sole input.

It is worth noting that this task bears close resemblance to the well-researched area of video summarization using only videos (?).

3.2.6 Video Storytelling - Datasets

Similar to image storytelling datasets, currently two different datasets are available to address the task of video storytelling. In the following, we elaborate on these two datasets.

VideoStory.

VideoStory (?) is a multi-sentence description dataset created from social media videos that are selected to be highly diverse and engaging. Table 48 shows more statistics on the dataset.

Models can be evaluated locally on the earmarked test set whereas test (blind) is reserved for online evaluation purposes. However, the dataset including annotations has not been made public yet. Table 49 presents actual number of videos, clips, and sentence annotations for each of the splits.

VideoStory-NUS.

The VideoStory-NUS353535https://zenodo.org/record/2383739 (?) dataset contains social event videos that were collected from YouTube by querying for common and complex events, namely Birthday, Camping, Christmas, and Wedding. Specifically, it comes with 105 manually chosen videos with sufficient inter-event and intra-event variations which are annotated with descriptive stories obtained through AMT. Each video is annotated by at least 5 different AMT workers, thus resulting in 529 stories in total. More statistics of the dataset can be found in Table 50.

For experimental purposes, the dataset is randomly split in a ratio of 14:3:3 for training, validation, and testing respectively. Actual numbers are presented in Table 51.

3.2.7 Video Storytelling - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different Video Storytelling models and the results obtained by them.

Evaluation Measures.

To evaluate Video Storytelling models, the Language metrics and Retrieval metrics presented in Section 3.1.6 are used.

Models.

There are a number of different models available for the task of Video Storytelling. These models combine representations of video and language in an efficient manner to address the task. In Table 52, we present some exemplar architectures (refer to Combined column) created to accomplish the task by integrating both video and language inputs. To understand the optimization techniques used, we also include a column that showcases the optimization method used to train the models.

Results.

The Video Storytelling results showcases the efficacy of the proposed models. In Table 53 and Table 54 we present results obtained with a subset of models built using the datasets presented earlier in Section 3.2.6.

3.2.8 Video Storytelling - Discussion

For Video Storytelling, a different set of methods are used for comparing two datasets. In Table 53, we observe that only one method utilizing the sequence-to-sequence paradigm with contextual information (i.e., seq2seq+context) is evaluated on the “VideoStory” dataset. Nevertheless, another set of methods used for comparison for the “VideoStory-NUS” dataset is in Table 54. It shows that the approach proposed by ? (?) using Residual BRNN with k-Nearest Neighbours (i.e., ResBRNN-kNN) outperforms most of the baseline methods.

4 Visual Referring Expression Comprehension and Generation

In this section, we explore the task of Visual Referring Expression Comprehension and Generation. The objective of the task is to ground a natural language expression (e.g. a noun phrase or a longer piece of text) to objects in a visual input.

4.1 Image Referring Expression Comprehension and Generation

In the following, we provide a detailed description of the Visual Referring Expression Comprehension and Generation by using an image as the visual input.

4.1.1 Image Referring Expression Comprehension and Generation - Intro

In a natural environment, people use referring expressions to unambiguously identify, indicate, or point to particular objects. This is usually done with a simple phrase or within a larger context (e.g. a sentence). Having a larger context provides better scope for avoiding ambiguity and allows the referential expression to easily map to the target object. However, there can also be other possibilities in which people are asked to describe a target object based on its surrounding objects.

This provides us with two different possibilities for the visual referring expression task. In the first scenario, referring expressions deal with generation, in which an algorithm generates a referring expression for a given target object that is present in a visual scene. In the second scenario, the referring expression is used to perform comprehension, in which an algorithm locates in an image the object described by a given referring expression. Figure 6 shows an example for the task of referring expression comprehension.

Given these tasks, different approaches have been proposed for referring expression generation (?, ?), comprehension (?), and both combined (?, ?). Note that there is a difference between referring expression tasks and grounding of free-form textual phrases (?) in an image.

Image Referring Expression Generation.

An initial approach (?) tackled the problem from the perspective of density estimation, in which the goal was to learn distributions over logical expressions identifying sets of objects in the world. Other research designed a comprehension-guided referring expression generator (?) by using a comprehension module trained on human-generated expressions to generate referring expressions.

Image Referring Expression Comprehension.

? (?) investigated referring expression comprehension to integrate contexts between objects. Later on, techniques such as Multiple Instance Learning (MIL) were used to explore context regions and max-margin based MIL objective functions for training. Further, ? (?) leveraged a natural language query of the object to localize a target object using a Spatial Context Recurrent Convnet (SCRC) model. It operates as a scoring function on candidate boxes for object retrieval, integrating spatial configurations and global scene-level contextual information. This explicit modeling of the referent and context region pairs has proven useful. Approaches such as compositional modular networks (?) analyzed referential expressions by identifying entities and relationships mentioned in the input expression and grounding them all in the scene. Such an approach has been shown to effectively inspect local regions and pairwise interactions between them. A modular approach was also explored where three modular components related to subject appearance, location, and relationship to other objects was used to model with Modular Attention Network (?). It has proven effective at focusing on the subjects and their relationships. Approaches such has GroundNet (?) have leveraged syntactic analysis of the input referring expression to build a dynamic computation graph of neural modules that definesan architecture for performing localization. Variational models have also been used for referential expression comprehension where variational Bayesian methods called variational context (?) were used to solve the problem of complex context modeling. These methods have proven capable of exploiting the relation between the referent and context, thereby reducing the search space of context. Furthermore, an accumulated attention mechanism (?) has been proposed to accumulate the attention for useful information in image, query, and objects. It has demonstrated the ability to reduce the redundancy and noise issues that were in other approaches.

Recently, a Cross-Modal Relationship Extractor (CMRE) and a Gated Graph Convolutional Network (GGCN) were combined into a cross-modal relationship inference network (?). CMRE has been shown to highlight objects and relationships which have connections with a given referring expression, while GGCN computes multimodal semantic contexts by fusing information from different modes and propagating multimodal information through the structured relation graph. Coming from a perspective of natural language understanding, a Recursive Grounding Tree (?) sought to automatically compose a binary tree structure by parsing the referring expression, in order to perform visual reasoning along the tree in a bottom-up fashion. It has been shown to allow gradients from continuous score functions with a discrete tree construction. There has also been interest in combining visual reasoning with referential expressions through the creation of new dataset (?). Most of the above approaches use bounding box localization, but additionally object segmentation (?) has also been explored for referring expression comprehension.

Image Referring Expression Generation and Comprehension.

Few approaches have performed both generation and comprehension tasks. Visual context (?, ?) was initially used in referring expression models to find visual comparison to other objects within an image. It has shown significant improvements. Further, a unified framework (?) was designed using a speaker, a listener, and a reinforcer. The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions. Feedback from the discriminative reinforcer has proven capable of benefiting the tasks. The role of attributes (?) was also studied to show that they help in disambiguation when referring to a particular object.

4.1.2 Image Referring Expression Comprehension and Generation - Datasets

For the task of image referring expression, both real and synthetic image datasets have been designed. In the following, we present the details of the datasets in separate sections.

Real Images.

In the real and natural images category, the ImageCLEF363636https://www.imageclef.org/SIAPRdata and MSCOCO ${}^{\ref{fnote: mscoco-dataset-url}}$ (see Section 3.1.2) datasets are commonly used for creating referring expression annotations. From a subset of ImageCLEF’s IAPR dataset\footreffnote: imageclef-iapr-dataset-url, referring expressions are collected in a game-based setting, namely ReferItGame\footreffnote: referitgame-url (?). The resulting dataset is called as RefCLEF\footreffnote: refcoco-dataset-github-url and its statistics can be found in Table 55.

The RefCOCO373737https://github.com/lichengunc/refer , RefCOCO+\footreffnote: refcoco-dataset-github-url (?), and RefCOCOg (?) datasets were all created using MSCOCO images. For RefCOCO and RefCOCO+, the “People vs. Object” split evaluates images containing multiple people (Test A) and images containing multiple instances of all other objects (Test B). Both RefCOCO and RefCOCO+ were collected in the same interactive setting as above, ReferItGame383838http://tamaraberg.com/referitgame (?). Table 56 presents the statistics of the RefCOCO dataset whereas Table 57 shows the statistics of the RefCOCO+ dataset.

One important distinction between the RefCOCO and RefCOCO+ datasets is that the latter was collected in a comparatively restrictive setting when compared to the former. Specifically, the usage of location words was not permitted in the referring expressions in case of RefCOCO+ whereas there was no such restriction on the language for RefCOCO.

To overcome some of the limitations of RefCLEF, a dataset based on based on MSCOCO\footreffnote: mscoco-dataset-url was created. This dataset, known as RefCOCOg393939https://github.com/mjhucla/Google_Refexp_toolbox (?), contains much longer sentences and was collected in a non-interactive setting using AMT, in contrast to the interactive setting used with RefCLEF, RefCOCO, and RefCOCO+. The statistics of this dataset is presented in Table 58.

Earlier mentioned referring expression datasets use single sentences for image referring expression. In contrast, the GuessWhat404040https://github.com/GuessWhatGame/guesswhat (?) dataset was created with a cooperative two-player guessing game, the goal of which was to locate an unknown object in an image (collected from MSCOCO) by asking a sequence of questions. Hence, it creates multiple sentences (i.e., a dialog) for a given image in order to perform referring expression. Another notable aspect of this dataset is that only images containing a number of objects in the range of 3 to 20 are chosen from MSCOCO. The dialogue collection was achieved via crowdsourcing using AMT. For evaluation, the dataset is randomly split into 70% for training, 15% for validation, and 15% for testing. Table 59 presents more details about the dataset.

Synthetic Images.

In the synthetic category, the CLEVR-Ref+414141https://cs.jhu.edu/~cxliu/2019/clevr-ref+.html (?) dataset was introduced to address issues such as bias in datasets with real images, since it has been recently been shown that referring expression models suffer from unintended biases (?). CLEVR-Ref+ reuses the images from the CLEVR dataset (see Section 5.2.2), while replacing the questions in CLEVR with referring expressions and answers with referred objects. The main purpose of CLEVR-Ref+ is to diagnose image reasoning with referring expressions by exercising the desired control over the nature of samples. Table 60 present splits of the dataset.

4.1.3 Image Referring Expression Comprehension and Generation - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different Image Referring Expression models and the results achieved by them.

Evaluation Measures.

The measure that is usually used for the evaluation of Image Referring Expression models is Precision@1, i.e., precision calculated with the Intersection over Union (IoU) ratio between the true and predicted bounding box.

Models.

The models designed to approach the task of Image Referring Expression provide an effective way to optimize the Precision@1 measure by identifying the right object in a visual input which matches the textual phrase. In Table 61, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both image and language inputs. We also include a column that showcases the optimization techniques used to train those models.

Results.

Several models and datasets have been created to address the task of Image Referring Expression. These datasets provide variety in the content so that they enhance the generalization ability of the models. In this section, we cover the results obtained by the models on some representative datasets. Table 62 and Table 63 presents results obtained with a subset of models built using the datasets such as RefCOCO, RefCOCO+, and RefCOCOg presented in Section 4.1.2.

4.1.4 Image Referring Expression Comprehension and Generation - Discussion

For Image Referring Expression, on all MSCOCO based datasets (i.e., RefCOCO, RefCOCO+, and RefCOCOg) the technique proposed by ? (?) outperforms existing baselines. This approach builds a Cross-Modal Relationship Extractor (CMRE) to highlight objects and their relationships. Furthermore, a Gated Graph Convolutional Network (GGCN) is used to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information. This Cross-Modal Relationship Inference Network (CMRIN) along with ResNet-101 visual features have been shown to achieve the best results.

4.2 Video Referring Expression Comprehension and Generation

In the following, we describe the setting of Visual Referring Expression Comprehension and Generation task when a video is used as the visual input.

4.2.1 Video Referring Expression Comprehension and Generation - Intro

When compared to image referring expression comprehension and generation, the task of video referring expression comprehension and generation is less explored at the time of publication of this survey. Although, there has been a surge in interest in tackling the spatio-temporal contexts and motion features that are inherent to videos, most of the works thus far, however, have concentrated on only one variant of image referring expression, namely comprehension. ? (?) used stereo videos to exploit richer and more realistic temporal-spatial contextual information along with gaze cues for referring expression comprehension. Figure 7 shows an example of the video referring expression comprehension.

Another approach by ? (?) explored Language Referring Expressions to point to the objects in the video to achieve object segmentation. Slightly different from the described task, ? (?) proposed an end-to-end boundary-aware model for video grounding. The model uses a lightweight branch to predict semantic boundaries corresponding to the given linguistic information. It aggregates contextual information by explicitly modeling the relationship between the current element and its neighbours.

4.2.2 Video Referring Expression Comprehension and Generation - Datasets

In this section, we present the datasets used to evaluate the task of Video Referring Expression Comprehension.

Object Referring in videos with Gaze (ORGaze).

For performing Video Referring Expression, the Cityscapes424242https://www.cityscapes-dataset.com dataset containing a diverse set of stereo video sequences recorded in street scenes has been modified to have gaze information. Therefore, the ORGaze434343https://people.ee.ethz.ch/~arunv/ORGaze.html (?) dataset contains object referring in videos with language and human gaze. More details of the dataset is presented in Table 64.

The authors split the cities in the training set of Cityscapes for training and validation while using all the cities in validation set of Cityscapes for testing purposes. More concretely, the validation set is constructed by selecting one city (e.g., Zürich) from Cityscapes training set while leaving the rest of the cities as part of the training set. For constructing the test set, the videos from all the cities in Cityscapes validation set (e.g., Frankfurt, Lindau, Münster) of Cityscapes are used. Of the total 30,000 annotated objects, 80% has been used for training and the remaining 20% was reserved for model evaluation of the task.

4.2.3 Video Referring Expression Comprehension and Generation - Evaluation Measures, Models, and Results

In this section, we review the evaluation measures used to benchmark different Video Referring Expression Comprehension models and the results achieved by them.

Evaluation Measures.

The measure that is used for the evaluation of Video Referring Expression Comprehension model is “Top-1 Accuracy” and also object proposal accuracy referred with Language-based Object Proposals (LOP), Faster R-CNN (FRCNN), and EdgeBox (?).

Models.

Many models have been created to solve the task of Video Referring Expression Comprehension. In Table 65, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both video and language. We also include a column that showcases the optimization techniques used to train those models.

Results.

As discussed earlier, several models have been created to approach the task of Video Referring Expression Comprehension. In Table 66 we present results obtained with a subset of models built using the ORGaze dataset presented earlier in Section 4.2.2.

4.2.4 Video Referring Expression Comprehension and Generation - Discussion

The Video Referring Expression Comprehension task is benchmarked using a single dataset. Evaluated using different task-specific metrics, the approach proposed by ? (?), which uses the gaze information, produces the best results.

5 Visual Question Answering, Reasoning, and Entailment

In this section, we explore three different tasks, namely, Visual Question Answering, Visual Reasoning, and Visual Entailment. The goal of each of these tasks are different. However, they share the common intention of answering questions when conditioned on a visual input. In the following sections, we elaborate on each of these three tasks separately.

5.1 Visual Question Answering

The goal of Visual Question Answering (VQA) is to learn a model that comprehends visual content at both the global and local level for finding an association with pairs of questions and answers in the natural language form. The visual information for VQA includes both images and videos.

5.1.1 Image Question Answering - Introduction

The aim of Image Question Answering (Image Q&A) is to answer natural language questions about the contents of images. Earlier research efforts have focused on designing different algorithms and constructing datasets to address this challenge. The first approaches (?, ?, ?) considered Image Q&A as a Visual Turing Test, where the expectation was to incorporate human-level abilities for semantically accessing the visual information to answer different questions. These were then improved as fill-in-the-blank tasks (?), where the goal of the system was focused on multiple-choice question-answering for images. Also, it was expanded to address both multilingual (?) and automatic question generation, in which descriptions of sentences are converted into questions (?). However, it lacked natural language questioning ability of humans. Hence, a broader task was proposed with an aim of addressing open-ended Image Q&A (?, ?), where the challenge was to ask a free-form natural language question about an image and make the system to answer the question. Figure 8 provides a schematic representation of the task where a free-form question about the contents of an image is asked to obtain an answer.

However, designing such a system can contain several other challenges, such as coming up with strong baselines (?). To address these, binary image Q&A (?) was explored by providing complementary images for abstract scenes. These complementary images were used to provide visual verification of concepts contained in the questions. Some of the questions were understood as a loose, global association between Q&A sentences and images. Hence, more confined and dedicated tasks were created for relating local regions in the images (?) by addressing object-level grounding. Some approaches (?) concentrated only on counting objects in natural images. There are many methods that are proposed to address the challenging image Q&A task. The details about different methods are already covered in earlier surveys (?, ?). Therefore, we briefly present new methods that were introduced after the publication of these surveys.

Recent works aim at interpretability or explainability by overcoming priors (?), concentrating better on the image to extract relevant information (?), generating human-interpretable rules that provide better insights (?), and cycle-consistency (?), while other works try to understand the text inside an image to answer and reason about it (?). More recent works sought to incorporate outside knowledge (?) in the image Q&A framework to support real-world knowledge-aware question answering (?).

There are different kinds of learning approaches used for image Q&A, such as Multi-task learning and Federated learning. A multi-task learning approach (?) is used to learn a vision-language representation that is shared by many tasks from their diverse datasets to address image Q&A. In contrast, federated learning is used with the aimNet (?) and is validated on federated learning settings that include both horizontal and vertical federated learning. To focus on language priors, a modular language attention mechanism is used by ? (?) to parse a question into three phrase representations, namely type representation, object representation, and concept representation. It has prevented language priors from dominating the answering process.

5.1.2 Image Question Answering - Datasets

Several datasets were created in the past decade to address the challenge of image question answering. In the following, we cover the datasets that are extensively used for this Human-Computer Interaction (HCI) themed task.

VQA v1.0.

VQA v1.0444444https://visualqa.org (?) contains open-ended questions about images. These questions target different areas of an image, including background details and the underlying contexts. The answers are also open-ended and contain either a few words or a closed set of answers that can be provided in a multiple-choice format. Table 67 and Table 68 present the dataset splits of images with real and abstract scenes observed in the dataset respectively.

VQA v2.0.

VQA v2.0 extends VQA v1.0 and has three parts: Balanced Real Images, Balanced Binary Abstract Scenes, and Abstract Scenes. Table 69 and Table 70 presents the dataset splits of the images with balanced real and binary abstract scenes observed in the dataset respectively. However, abstract scenes in VQA v2.0 are same as that of VQA v1.0.

The term complementary pairs in Table 69 means that a given question is associated with a pair of similar images such that the answer is different depending on the image (i.e. two different answers)

Outside Knowledge VQA (OK-VQA).

OK-VQA454545https://okvqa.allenai.org (?) uses a subset of MSCOCO (see Section 3.1.2) and is constructed with additional annotations such as questions, answers, knowledge category, etc. Table 71 presents more details about the dataset, while the Table 72 shows the splits of it.

Knowledge-aware VQA (KVQA).

The KVQA464646http://malllabiisc.github.io/resources/kvqa (?) dataset was designed to emphasize questions that require access to external knowledge. Table 73 presents more details about the dataset, while Table 72 shows the splits of it. In order to get a mean score, the KVQA dataset provides five such splits.

5.1.3 Image Question Answering - Evaluation Measures, Models, and Results

In this section we describe only the evaluation measures used for Image Question Answering as Models, Results, and some Discussion are extensively presented in the recent surveys (?).

Evaluation Measures

Image Q&A models are evaluated based on the Accuracy measure.

5.1.4 Video Question Answering - Introduction

The goal of Video Question Answering (Video Q&A) is to answer natural language questions about videos. Unlike Image Q&A, Video Q&A is less explored. Nevertheless, there are a few works which have explored this spatio-temporal domain. One of the early attempts in this domain was jointly parsing the videos with corresponding text to answer queries (?). Further, an open-ended Movie Q&A (?) with multiple-choice question pairs was designed to solve challenging questions that require semantic reasoning over a long temporal domain. Additionally, to limit the involvement of crowdworkers, the task was modified using fill-in-the-blank questions (?, ?) and were automatically generated from different manually created video description datasets (Section 3.1.5). Other works (?) modified this dataset to support answering free-form natural language questions. Beyond this, open-ended video question answering is also addressed with methods such as spatio-temporal attentional encoder-decoder learning framework (?). There has been interest shown in jointly addressing multiple tasks that handle video and language. High-level concept words (?) are detected in order to be integrated with any video and language models addressing fill-in-the blank and multiple-choice test. Spatio-temporal reasoning from videos to answer questions has also been addressed by designing a spatial and temporal attention mechanism (?).

Recently, due to large interest in Video Q&A, similar to Movie Q&A, six popular TV shows were used to create a dataset, where questions are compositional (?). The TV Q&A dataset made the proposed multi-stream models to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and then recognize relevant visual concepts. Furthermore, spatio-temporal grounding (?) is employed to link depicted objects to visual concepts in questions and answers. Figure 9 shows an example of this task, in which the model is given a video and a question and is asked to choose an answer from multiple choices.

5.1.5 Video Question Answering - Datasets

Similar to image question answering, several datasets were created to address the challenge of video question answering. In the following, we cover those datasets that are popular and extensively used.

MovieQA.

The MovieQA474747http://movieqa.cs.toronto.edu/home (?) dataset is used to evaluate story comprehension of both video and text in an automatic manner. The dataset consists of almost 15,000 multiple choice questions and answers obtained from over 400 movies having high diversity. Table 75 reports the statistics and splits of the dataset.

TVQA.

The TVQA484848http://tvqa.cs.unc.edu (?) dataset was created from videos of six different English TV shows, viz. Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey’s Anatomy, and Castle. It consists of 460 hours of video and the questions are designed to be compositional, expecting the models to comprehend subtitles-based dialogue and to recognize relevant visual concepts. Table 76 presents the statistics of the dataset, while Table 77 shows the splits.

The testing data of TVQA is further split into two subsets named “test-public” containing 7,623 Q&A pairs and “test-reserved” consisting of 7,630 Q&A pairs. The test-public set is available for the TVQA leaderboard494949http://tvqa.cs.unc.edu/leaderboard.html whereas test-reserved is preserved for future use.

The TVQA+505050http://tvqa.cs.unc.edu/download_tvqa_plus.html (?) is an augmented subset of the original TVQA dataset where the augmentation comes in the form of bounding boxes linking depicted objects to visual concepts in both questions and answers. Table 78 presents the splits of TVQA+ dataset.

5.1.6 Video Question Answering - Evaluation Measures, Models and Results

In this section, we present the evaluation measures, models, and results achieved with various architectures of Video Q&A.

Evaluation Measures.

Video Q&A models are evaluated based on Accuracy. In addition, other measures such as Temporal mean Intersection-over-Union (Temp. mIoU) (?), Answer-Span joint Accuracy (ASA), that jointly evaluates both answer prediction and span prediction, and object grounding performance calculated with mean Average Precision (Grd. mAP) (?) are used.

Models.

The models which are created to address the task of Video Question Answering aim to provide an overall understanding of the visual and the aligned textual content such as subtitles. In Table 79, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both video and language. We also include a column that showcases the optimization techniques used to train those models.

Results.

Several models have been created to approach the task of Video Question Answering. At the same time, many datasets have been created to provide diversity in the content so that they boost the generalization ability of the models. In this section, we cover the results achieved by the models on some representative datasets. Table 80 and Table 81 presents results obtained with a subset of models built using the TVQA and TVQA+ datasets presented in Section 5.1.5. Results for TVQA515151http://tvqa.cs.unc.edu/leaderboard.html and TVQA+525252https://competitions.codalab.org/competitions/22705#results can also be found on the respective leaderboards.

5.1.7 Video Question Answering - Discussion

It has been observed from STAGE (?) that aligned fusion is essential for improving Video Q&A performance. STAGE uses all of the existing information such as Subtitles, Video, and Questions to build an efficient model. It has also proven to be effective if the models have access to the timestamp information as shown in Table 80.

5.2 Visual Reasoning

The goal in visual reasoning is to learn a model that comprehends the visual content by reasoning about it. Both images and videos are used as visual inputs for visual reasoning. In the following, we provide a detailed description of this complex and challenging task.

5.2.1 Image Reasoning - Introduction

The goal of image reasoning is to answer sophisticated queries by reasoning about the visual world. Initial efforts (?) aimed at designing diagnostic tests going beyond benchmarks such as VQA. They reduced the biases by having detailed annotations describing the kind of reasoning each question requires. It has also been observed that VQA models struggle when comparing the attributes of objects, or when novel attribute combinations needs to be recognized (such as in compositional reasoning). A novel approach (?) used a program generator to construct an explicit representation of the reasoning process, and an execution engine to execute the resulting program, producing an answer. Then, end-to-end module networks (?) were proposed which learn to reason by directly predicting instance-specific network layouts without the aid of a parser as used in neural module networks. ? (?) went beyond and proposed Relation Networks (RNs) as a simple plug-and-play module to solve the problem of visual reasoning. RNs are further used to learn relation-aware visual features for content based image retrieval (?) and also Multi-Relational Networks (?). Furthermore, global context reasoning (?) is explored for better aligning image and language domains in diverse and unrestricted cases.

A recent approach (?) introduced a general-purpose conditioning method called Feature-wise Linear Modulation (FiLM) layers which influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. FiLM was modified by ? (?) to generate parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once. Cascaded Mutual Modulation (CMM) (?) is an end-to-end visual reasoning model that also uses the FiLM technique to enable the textual/visual pipeline to mutually control each other. Another approach modified neural modular networks (?) such that it performs compositional reasoning by automatically inducing a desired sub-task decomposition without relying on strong supervision. ? (?) proposed a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly interpretable manner. Also, in the context of interpretable learning frameworks, Learning-By-Asking (LBA) (?) attempted to closely mimic natural learning with the goal to make it more data efficient than the traditional VQA setting. Further, compositional attention networks (?) were designed as fully differentiable neural network architectures to facilitate explicit and expressive reasoning. The goal of this architecture is to provide a strong prior for iterative reasoning, allowing it to support structured learning, as well as to generalize from a modest amount of data.

Recently, neural-symbolic visual question answering (?) attempted to combine deep representation learning with symbolic program execution. It first recovers structural scene representation from the image and a program trace from the question. This was extended with a Neuro-Symbolic Concept Learner (NS-CL) (?) that learns visual concepts, words, and semantic parsing of sentences without explicit supervision. It learns by simply looking at images and reading paired questions and answers. Further, a multimodal relational network (MuRel) (?) was proposed to learn end-to-end reasoning over real images. Additionally, ? (?) used spatial knowledge to aid visual reasoning. Their framework combined knowledge distillation, relational reasoning, and probabilistic logical languages. Existing diagnostic tests have been further modified with referring expressions to handle bias (?) and with structural, relational, and analogical reasoning in a hierarchical representation (?). Explainable and explicit neural modules (?) have also been explored with scene graphs. Objects as nodes and pairwise relationships as edges were used for explainable and explicit reasoning with structured knowledge.

Further expanding the scope of inquiry on this subject, ? (?, ?) exploit the compositional linguistic structure of complex questions by forming neural module networks which query about the abstract shapes observed in an image. Improvement is further seen in how images are interpreted. For example, compositional question answering (?) was addressed with scene graph structures on real-world images going beyond abstract shapes. Figure 10 demonstrates the task of reasoning about real-world images.

? (?) introduced a reasoning task that requires commonsense knowledge, while the goal of NLVR (?) and NLVR2 (?) tasks is to determine whether a sentence is true about a visual input or not.

5.2.2 Image Reasoning - Datasets

For image reasoning, both real and synthetic image datasets have been developed. In the following, we present the datasets belonging to both of these categories.

Compositional Language and Elementary Visual Reasoning (CLEVR).

CLEVR535353https://cs.stanford.edu/people/jcjohns/clevr (?) is a diagnostic dataset created using a 3D computer graphics toolkit known as Blender545454https://www.blender.org. It consists of synthetic images of simple 3D objects that vary in their attributes, viz. size, color, shape, and material. Images contain three to ten different combinations of these objects and attributes and are arranged in different spatial positions. Such complex configurations require good visual reasoning capabilities from VQA models to produce correct answers. Table 82 presents the splits of dataset.

Natural Language Visual Reasoning (NLVR).

The Cornell Natural Language for Visual Reasoning dubbed as NLVR555555http://lil.nlp.cornell.edu/nlvr (?) is a multimodal dataset that comes with natural language sentences grounded in synthetic images. The images are rendered and encapsulate different objects such as triangles, circles, and squares. These objects come in various sizes and are placed at different positions within images. The descriptions of the images were manually written by crowdworkers. Table 83 presents the official splits of the dataset for evaluation purposes.

Natural Language Visual Reasoning for Real (NLVR2).

The limitations such as limited expressivity and semantic diversity that arose due to the synthetic nature of the NLVR dataset, has been addressed in the next incarnation of NLVR named as Natural Language for Visual Reasoning for Real, NLVR2\footreffnote:nlvr-dataset-url (?). Similar to NLVR, the images in NLVR2 also come as a pair along with a grounded natural language description. Table 84 presents the official splits of the dataset.

CLEVR-CoGenT.

A modified version of CLEVR is the Compositional Generalization Test (CLEVR-CoGenT)\footreffnote: clevr-dataset-url (?) dataset. It is used to test models’ ability to find novel combinations of attributes at test-time. There are two types of conditions in this dataset, viz. Condition A and Condition B, where based on the condition, the color of the geometrical shape can vary as show in Table 85. Based on these conditions, the CLEVR-CoGenT dataset is divided for evaluation purposes as shown in Table 86.

GQA.

The GQA565656https://cs.stanford.edu/people/dorarad/gqa (?) dataset was created to address the shortcomings in earlier VQA datasets. GQA consists of compositional questions over real-world images. Each image is associated with a scene graph of the image’s objects, attributes, and relations. Also, each question is associated with a structured representation of its semantics. Table 87 presents the statistics and splits of the dataset.

Relational and Analogical Visual rEasoNing (RAVEN).

The RAVEN575757http://wellyzhang.github.io/project/raven.html (?) dataset was designed to perform relational and analogical visual reasoning. It is built by keeping in mind Raven’s Progressive Matrices (RPM) (?). Furthermore, it associates vision with structural, relational, and analogical reasoning in a hierarchical representation. The dataset is split into training, validation, and testing in the ratio 6:2:2 respectively. Table 88 presents the statistics of the dataset.

Visual Commonsense Reasoning (VCR).

VCR585858https://visualcommonsense.com (?) is a large-scale dataset for achieving cognition-level visual understanding. It contains about 110k images, 290k multiple choice questions and correspondingly 290k correct answers and rationales. This dataset is very diverse and, consequently, it is challenging. Table 89 presents the official splits and some high-level statistics of the dataset.

Visual COMmonsense rEasoning in Time (Visual COMET).

Visual COMET595959https://visualcomet.xyz (?) is a large-scale dataset of Visual Commonsense Graphs for reasoning about the dynamic context of static images in order to achieve cognitive visual scene understanding. VisualCOMET contains images with person grounding (i.e., multimodal co-reference chains) and the images are connected with inference sentences. Table 90 presents the official splits and more statistics about the dataset.

5.2.3 Image Reasoning - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Image Reasoning and the results obtained by them.

Evaluation Measures.

The standard evaluation measures such as Accuracy are used for benchmarking purposes. However, there are evaluation measures that are explicitly used for Image Reasoning (e.g., CLEVR), viz. Querying Attribute (QA) that uses questions to ask about an attribute of a particular object, Compare Attribute (CA) which uses comparison questions for asking whether two objects have the same value for some attribute, Compare Numbers (CN) which uses comparison questions to ask which of two object sets is larger, Count which asks counting questions to find the number of objects fulfilling some conditions, and Exist which asks existence questions to check whether a certain type of object is present or not.

Models.

The models that are designed to approach the task of Image Reasoning are built such that they provide an effective way of reasoning about vision with language as additional input. In Table 91, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both image and language. We also include a column that showcases the optimization techniques used to train the Image Reasoning models.

Results.

The models designed on different Image Reasoning datasets aim to achieve generalization. In this section, we cover the results achieved by the models from some representative datasets. Table 92, Table 93, Table 94, and Table 95 presents results obtained with a subset of models built using the datasets such as CLEVR, GQA, VCR, and RAVEN that were presented in Section 5.2.2. Results for the NLVR and NLVR2 tasks can be found on the respective leaderboards606060http://lil.nlp.cornell.edu/nlvr.

5.2.4 Image Reasoning - Discussion

The task of Image Reasoning has been studied using different types of datasets. Initially, a synthetic dataset, viz. CLEVR, was used. Later, real-world datasets like GQA were created for developing more complex vision and language integration models. Table 92 shows the results for the CLEVR dataset. Recently introduced Neuro-Symbolic Concept Learner (NS-CL) (?) reaches state-of-the-art results without explicit supervision on visual concepts, words, and semantic parsing of sentences. However, for the real-world image datasets like GQA, the approach by ? (?) that creates Language-Conditioned Graph Networks (LGCN) providing different hops to effectively support relational reasoning achieve best results. Most of the works that outperform on the VCR task are pretrained and fine-tuned as shown in Table 94.

The RAVEN dataset differs from both CLEVR and GQA as it depends only on the image input. We can observe from Table 95 that a perfect solver achieves 100% accuracy, while the approach introduced by ? (?) achieves reasonable system performance.

5.2.5 Video Reasoning - Introduction

When compared to image reasoning, the video reasoning task is in its nascent stages and hence there is no clearly defined goal. However, for video reasoning, there exists a task of configurable visual question and answer (COG) designed by ? (?). The goal of COG is to address problems related to visual and logical reasoning and memory. To be more concrete, the task is aimed at deducing the correct answer by pointing to the right object while taking into account the changes of the scene i.e., from both spatial and temporal perspective. Figure 11 demonstrates the task of temporal reasoning about synthetic 2D scenes resembling video input.

Further, ? (?) addressed both image and video reasoning by introducing the concept of a question-based visual guide to constrain the potential solution space by learning an optimal traversal scheme. In their approach, the final destination nodes alone are used to produce the answers.

5.2.6 Video Reasoning - Datasets

There are not many datasets for video reasoning. One of the few examples is listed below.

Configurable Visual Question and Answer (COG).

COG616161https://github.com/google/cog#datasets (?) was created to parallel experiments in humans and animals. Table 96 presents splits of the dataset.

5.2.7 Video Reasoning - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Video Reasoning and the results obtained by them.

Evaluation Measures.

For Video Reasoning (e.g., COG) the evaluation measures used are based on changes of the scene in three different query types.

•

Pointing (Point) queries ask the system to point to a certain object.

•

Yes/No seeks a binary decision, while Conditional (Condit) is composed of questions based on objects that needs to fulfill certain conditions.

•

Attribute-related (Atts) which is composed of questions about certain attributes.

Models.

Many models have been created to approach the task of Video Reasoning. In Table 97, we present some exemplar architectures (refer to Combined column) created to address the task by integrating video and language.

Results.

As discussed earlier, several models have been created to approach the task of Video Reasoning. In Table 98, we present the results obtained with a subset of models built using the COG dataset presented in Section 5.2.6.

5.2.8 Video Reasoning - Discussion

The results presented in Table 98 show that the recently proposed approach by ? (?) achieves the best result on different task-specific measures. This approach proposes a question-based visual guide, which constrains the potential solution space by learning an optimal traversal scheme of a graph.

5.3 Visual Entailment

Goal of the Visual Entailment task is to learn a model that predicts whether the visual content entails the augmented text along with hypothesis. Both images and videos are used as visual inputs. In the following, we describe the task, datasets used, and the approaches that have been proposed to tackle the problem.

5.3.1 Image Entailment - Introduction

To address the perceived drawbacks of VQA and visual reasoning, i.e. that they deal with similar objects and sentence structures, ? (?) initially proposed a visually-grounded version of the Textual Entailment task where an image is augmented with textual premise and hypothesis. However, this task was refined by ? (?) to predict whether the image semantically entails the text, given image-sentence pairs, where the premise is defined by an image instead of a natural language sentence. Figure 12 illustrates the task, where the image as a premise and a piece of text as hypothesis are used by the Image Entailment model to predict whether the hypothesis is an entailment, contradiction, or neutral.

5.3.2 Image Entailment - Datasets

The image entailment task is achieved using two different datasets. One dataset extends Natural Language Inference with Visually-grounded Natural Language Inference (V-SNLI) (?) while the other extends the Flickr30K dataset (see Section 3.1.2) into a visual entailment dataset (SNLI-VE)626262https://github.com/necla-ml/SNLI-VE (?). Table 99 and Table 100 presents the statistics and splits of these two datasets respectively.

5.3.3 Image Entailment - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Image Entailment and the results obtained by them.

Evaluation Measures.

Image Entailment task is evaluated using the Accuracy measure.

Models.

Two different models are created to approach the task of Image Entailment. In Table 101, we present some exemplar architectures (refer to Combined column) created to address the task. We also include a column that showcases the optimization techniques used to train those models.

Results.

The Image Entailment models leverage both image and textual input representations to build an entailment pipeline. In Table 102, Table 103, and Table 104 we present results obtained with a subset of models that were built using the datasets presented in Section 5.3.2.

5.3.4 Image Entailment - Discussion

The task of Image Entailment was evaluated using two different datasets. Table 103 and Table 104 shows results obtained from V-SNLI in different settings. The approach proposed by ? (?) that creates a visually grounded Bilateral Multi-Perspective Matching (BiMPM) model achieves the best result for the entailment task.

Similarly, evaluations conducted with SNLI-VE dataset (cf. Table 102) show that the Explainable Visual Entailment (EVE) approach proposed by ? (?) achieves the best overall result.

5.3.5 Video Entailment - Introduction

Video entailment (?) aims to infer whether the natural language hypothesis is entailed or contradicted when given a video clip aligned with the subtitles information. The video contains diverse temporal dynamics, event shifts, and social interactions. Figure 13 illustrates the task: given a video clip with aligned subtitles as premise and a natural language hypothesis based on the video content, a video entailment model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.

5.3.6 Video Entailment - Datasets

The Video Entailment task is proposed by ? (?), with the introduction of a large-scale dataset called as VIdeO-and-Language INference (VIOLIN)636363https://github.com/jimmy646/violin. Detailed statistics of the dataset is presented in Table 105.

For training and model evaluation purposes, the VIOLIN dataset is split into training, validation, and test splits in the ratio of 8:1:1. The exact number of triplet instances in each of the splits is shown in Table 106.

5.3.7 Video Entailment - Evaluation Measures, Models, and Results

In this section, we present the evaluation measures, models, and results achieved with various architectures introduced for solving the Video Entailment task.

Evaluation Measures.

The Video Entailment models are evaluated using Accuracy.

Models.

Very few models have been created to approach the task of Video Entailment. The variation of the Video Entailment models include the usage of different type of textual content such as subtitles, statements, etc. In Table 107, we present some exemplar architectures (refer to Combined column) created to address the task by integrating both video and language inputs. We also include a column that showcases the optimization techniques used to train those models.

Results.

Few models which have been designed to approach the task of Video Entailment use different types of textual content aligned with video. In Table 108 we present results obtained with subset of models built using the VIOLIN dataset presented in Section 5.3.6. For building textual or visual representations, models such as SSV has used pretrained vision and language integration models such as LXMERT (?).

5.3.8 Video Entailment - Discussion

The task of Video Entailment was evaluated using the VIOLIN dataset and the recently proposed method by ? (?) has shown that using multi-source information arising from different types of data such as Statements, Subtitles, and Visual features are useful for building a robust model. In addition, textual features generated using contextualized word embedding models are effective as well.

6 Visual Dialog

In this section, we explore the task of Visual Dialog. The objective of visual dialog is different from the previously discussed tasks and involves a complex interaction between a human and an artificial agent.

6.1 Image Dialog

In the following, we describe the setting of Visual Dialog where an image is used as the visual input.

6.1.1 Image Dialog - Introduction

The goal of the image dialog task is to create AI agents that can hold dialog with humans in a natural language of choice about a visual content (?), represented by an image. To be more specific, given an image, a history of dialogs, and a question about the image, the goal of an AI agent is to ground the question in the image, infer the context from the history, and then answer the question accurately. However, this problem can also be construed as a task where the goal of the AI system is to locate an unknown object in the image by asking a sequence of questions (?) or to hold natural-sounding conversations about a shared image (?). In Figure 14, we provide a visual depiction to illustrate the said task.

Further, a standard agent can be extended to have a question and answer bot cooperating with each other for guessing images (?). To counter generic responses in dialog generation, knowledge transfer from dialog generation was explored with a discriminative dialog module trained to rank a list of candidate human responses (?). However, other approaches constrained themselves to specific domains and proposed end-to-end optimization schemes (?). ? (?) introduced attentive memory that exploits visual attention in the past to resolve the current reference. Recently, reinforcement learning and Generative Adversarial Networks (GANs) were also used to generate more human-like responses to questions in the image-based dialog (?). Dialog can also be seen from the perspective of a system which asks questions, and demonstrates how a visual dialog can be generated from discriminative question generation and answering (?). Furthermore, co-reference resolution was also investigated (?) to bridge the gap between nouns and pronouns with the usage of modules that form explicit and grounded co-reference resolution at word-level.

Recently, a novel attention mechanism called recursive visual attention (?) was proposed to resolve visual co-reference for visual dialog by browsing the dialog history. Another approach (?) formalized the task as inference in a graphical model with partially observed nodes and unknown graph structures, i.e., relations in dialog. Further, ? (?) extended one-stage solution to a two-stage solution by building an image-question-answer synergistic network to value the role of the answer for precise visual dialog. Other novel approaches (?) were also designed where a visually-grounded encoder was employed to synergize between guessing and asking questions. Further, a cooperative learning regime was followed to improve the accuracy.

6.1.2 Image Dialog - Datasets

For addressing the task of image dialog several datasets have been created. In the following, we elaborate each of them separately.

VisDial.

For Image Dialog, there exists two versions of this dataset, VisDial v0.9 and VisDial 1.0646464https://visualdialog.org/data (?). VisDial was created using the MSCOCO dataset. For VisDial v0.9, splits are divided only into the training and validation set. Table 109 and Table 110 present details about the splits of VisDial v0.9 and VisDial v1.0 respectively.

CLEVR-Dialog.

The CLEVR-Dialog656565https://github.com/satwikkottur/clevr-dialog (?) dataset was developed for studying multi-round reasoning in visual dialog. The dialog grammar is grounded in the scene graphs of the CLEVR dataset (Section 5.2.2), originally developed for reasoning about images. Table 111 provides statistics of the dataset, while Table 112 shows the dataset splits.

6.1.3 Image Dialog - Evaluation Measures, Models and Results

In this section, we review the measures used to evaluate different models of Image Dialog and the results achieved by these models.

Evaluation Measures.

The Image Dialog models are evaluated using the Retrieval metrics that have been discussed in Section 3.1.3.

Models.

The models created to approach the Image Dialog task continuously process a stream of images and textual dialog information. In Table 113, we present some exemplar architectures (refer to Combined column) designed to integrate image and textual dialog to address the task.

Results.

Models that are created to solve the task of Image Dialog effectively comprehends the complexity of the task. Several approaches are used to build the models with different versions of the same dataset. However, few approaches share some commonalities such as usage of Memory Networks (?). Table 114 and Table 115 presents the results obtained with a subset of both discriminative and generative models built using the “VisDial0.9” dataset. While Table 116 presents the results obtained only with a subset of generative models built using the “VisDial1.0” dataset presented earlier in Section 6.1.2.

6.1.4 Image Dialog - Discussion

For the Image Dialog task, two versions of the same dataset were used for evaluation. Similar approaches were used for the evaluation of both datasets with retrieval metrics. Nevertheless, the methods that achieve state-of-the-art performance on both datasets differ. Among the generative and discriminative methods on VisDial v0.9 dataset, the Recursive Visual Attention (RvA) approach proposed by ? (?) achieves the best result. RvA refines the visual attention recursively by browsing through the dialog history until the agent has sufficient confidence in its visual co-reference resolution. This has also been shown to generate interpretable attention maps without additional annotations.

For the VisDial v1.0 dataset, the results presented in Table 116 show that Synergistic-ensemble by ? (?) outperform RvA.

6.2 Video Dialog

In this part, we present details about the Visual Dialog task in which a video is used as the visual input and a conversational chat with humans about the visual content is expected.

6.2.1 Video Dialog - Introduction

The aim of video dialog is to leverage scene information containing both audio (which can be transcribed as subtitles) and visual frames to hold a dialog (i.e., an exchange) with humans in a natural language of choice about the multimedia content (?, ?). A successful system is expected to ground concepts from the question in the video while leveraging contextual cues from the dialog history. Figure 15 illustrates the video dialog task.

Several approaches have been proposed to address the task, where initially multimodal attention-based video description features were used to improve dialog (?). Further, a novel baseline (?) analyzed components such as data representation, extraction, attention, and answer generation in order to show that there can be relative improvements as compared to other approaches.

6.2.2 Video Dialog - Datasets

Audio Visual Scene-Aware Dialog (AVSD)666666https://video-dialog.com (?) was created for the Scene-Aware Dialog Challenge, in which the agent grounds its responses on the dynamic scene, the audio, and the history (previous rounds) of the dialog. Table 117 presents some statistics and the splits of the AVSD dataset.

6.2.3 Video Dialog - Evaluation Measures, Models, and Results

In this section, we review the evaluation measures used to benchmark different models of Video Dialog and the results obtained by these models.

Evaluation Measures.

The Video Dialog models are evaluated using the “Retrieval metrics” discussed in Section 3.1.3.

Models.

Only a couple of models have been proposed so far to approach the task of Video Dialog. These models aim to capture the temporal aspect of a video and incorporate it in the textual dialog. In Table 118, we present some exemplar architectures (refer to Combined column) designed to address the task by integrating both video and language inputs. We also include a column that showcases the optimization techniques used to train those models.

Results.

As discussed earlier only few models have been created to address the task of Video Dialog. In Table 119 we present the results obtained with those models built using the “AVSD” dataset presented earlier in Section 6.2.2.

6.2.4 Video Dialog - Discussion

The Video Dialog task is evaluated with the AVSD dataset. Different strategies have been explored to fuse the language and video features to create a strong baseline. In particular, the approach proposed by ? (?), which uses beam search and the attention mechanism (i.e., Att-base-beam) over different modalities, outperforms other baseline methods.

7 Multimodal Machine Translation

In this section, we explore the task of Multimodal Machine Translation (MMT). The goal of this task is to translate natural language sentences that describe visual content (e.g. image) in a source language into a target language by taking the visual content as an additional input to the source language sentences.

7.1 Machine Translation with Image

In the following, we elaborate on the Multimodal Machine Translation task by considering image as the only visual input.

7.1.1 Machine Translation with Image - Introduction

The aim of MMT (?, ?, ?, ?) is to translate sentences, that describe an image, in a source language into equivalent sentences in a target language. However, for any given image the description can be written in different source languages, resulting in multiple source language descriptions. This situation opens up the possibility to propose different variants of the MMT task. The first variant is a single source translation task, in which the image description in a single source language is translated to the target language with additional cues from the corresponding image. Figure 16 depicts this variant where an image is accompanied with its description in English and needs to be translated by the model into a description in German.

The second variant is a target language description generation task with additional source language cues, i.e., multiple source language descriptions of the same image termed as multisource MMT. Figure 17 illustrates this variant, where an image is accompanied with its descriptions in English (en), French (fr), and Czech (cs), that are all used to generate the German (de) translation.

Different approaches have been proposed to handle single source MMT by associating visual and textual features with multimodal attention (?). Further, a novel approach where a doubly-attentive decoder incorporated visual features to bridge the gap between image description and translation was proposed (?). In a similar vein, global visual features were incorporated in an attention-based multimodal NMT (?). This is achieved by attending to source-language words and parts of an image independently by means of two separate attention mechanisms.

MMT task can also be solved using two sub-tasks: learning to translate, and learning visually grounded representations (?), both combined in a multi-task learning framework. Further, an advanced multimodal compact bilinear pooling method (?, ?) has also been used for MMT in which the outer product of two vectors combines the attention features of the two modalities. Another model (?) used a shared visual-language embedding and a translator for learning. This joint model leverages a visual attention grounding mechanism that links the visual semantics with the corresponding textual semantics. Due to the presence of large multimodal data on the web, noisy image captions have also been tried for MMT (?). A latent variable model (?) has also been attempted in which the latent variable can be seen as a multimodal stochastic embedding of an image and its description in a foreign language.

MMT models have also been used in an adversarial setting. ? (?) found that even in the presence of visual features from unrelated images there is no significant performance degradation. Due to the recent success of unsupervised machine translation (?), there is also a growing interest in extending it for unsupervised MMT (?). Other studies (?) have reduced criticism of MMT by showing that under the limited textual context, MMT models are capable of leveraging the visual input to generate better translations. Regarding multisource models, ? (?) explored MMT using neural multi-source sequence-to-sequence learning.

7.1.2 Machine Translation with Image - Datasets

The main dataset used with the models above (Section 7.1.1) is the Multi30k-MMT676767https://www.statmt.org/wmt18/multimodal-task.html dataset (?), extended using the Flickr30k dataset. Along with English, it contains human translated German, French, and Czech language sentences. The splits of this dataset can be found in Table 120.

7.1.3 Machine Translation with Image - Evaluation Measures, Models, and Results

In this section, we review the evaluation measures used to benchmark different models of Machine Translation with Image and the results obtained by these models.

Evaluation Measures.

To evaluate Machine Translation with Image models, the “Retrieval metrics” presented in the Section 3.1.3 are used.

Models.

Several models have been created for the task of Machine Translation with Image. The aim of these models is to tackle translation using either a single or multiple language textual sources along with an image. In Table 121, we present some exemplar architectures (refer to Combined column) which integrate both image and language to address the task. We also include an “Optimizer” column that indicates the optimization techniques used to train those models.

Results.

In Table 122 and Table 123 we present the results obtained with a subset of models built using the Multi30k-MMT dataset presented earlier in Section 7.1.2.

7.1.4 Machine Translation with Image - Discussion

This task is evaluated using only one dataset, e.g., Multi30k-MMT, containing descriptions in three source languages and one target language. Results presented in Table 122 and Table 123 refer to the shared task proposed in different years. We can observe that based on different years of test set release, varied sets of approaches outperform the baseline methods.

7.2 Machine Translation with Video

In the following, we present more details about Multimodal Machine Translation by using the video as the visual input.

7.2.1 Machine Translation with Video - Introduction

The goal in video-guided machine translation (?) is to translate a source language description into the target language equivalent using the video information as additional spatio-temporal context.

Figure 18 illustrates this task where the English language description accompanied by a video needs to be translated into the equivalent description in German.

7.2.2 Machine Translation with Video - Datasets

The VATEX686868http://vatex.org/main/index.html (?) dataset was created for English and Chinese languages to perform machine translation with video and also for the task of generating multilingual video descriptions. Table 124 presents more details about the dataset.

7.2.3 Machine Translation with Video - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Machine Translation with Video and the results obtained by them.

Evaluation Measures.

To evaluate the Machine Translation with Video models, the Language metrics discussed in Section 3.1.3 are used.

Models.

Very few models have been created to investigate the task of Machine Translation with Video. The temporal aspect of a video is crucial for providing effective translations. In contrast to Machine Translation with Image, the task of Machine Translation with Video only has models which are built using single textual source. In Table 125, we present some exemplar architectures (refer to Combined column) which integrate both video and language inputs for addressing the task. We also include a column that showcases the optimization techniques used to train those models.

Results.

The models that have been created to address the task of Machine Translation with Video is built using a single dataset, namely VATEX. In Table 126 we present results obtained with a subset of models built using the VATEX dataset presented earlier in Section 7.2.2.

7.2.4 Machine Translation with Video - Discussion

In Table 126, we observe that only one method utilizing LSTM with video features from the pretrained I3D model (i.e., NMT+LSTM VI) is evaluated using the language metrics on the challenging VATEX dataset for both English and Chinese.

8 Language-to-Vision Generation

In this section, we explore the task of Language-to-Vision Generation. The goal of this task is to generate visual content given their natural language descriptions. However, different variations of the task exist and will be discussed in the following.

8.1 Language-to-Image Generation

In the following, we describe the setting of Language-to-Image Generation where an image is desired from a piece of natural language text (e.g., a sentence) describing the scene.

8.1.1 Language-to-Image Generation - Introduction

A litany of different variations of the Language-to-Image Generation exists. For example, generation of an image can also be thought as a manipulation of an image. It allows for the generation of a new image using desired natural language description. We present some variations in the following.

Sentence-level Language-to-Image Generation.

The goal is to generate images conditioned on the natural language descriptions. It is considered as a fundamental problem in many applications. The success of Generative Adversarial Networks (GANs) (?) has made possible the generation of interesting images of specific categories, such as room interiors, album covers, and faces (?). This has led to an interest in bridging the gap between natural language text and image modeling. Figure 19 shows the usage of natural language description for generating image with a Text-to-Image Generation Model.

Initially, alignDRAW (?) was introduced to iteratively draw patches on a canvas, while attending to the relevant words in the description. Further, it was shown that visual concepts could be translated from characters to pixels (?) with a conditional GAN. This was further improved by taking instructions about what content should be drawn in which location in order to achieve high-quality image generation (?). Models which were developed to condition on classes for image generation (?) have also been used to generate images. However, the quality of images generated is much lower than when not conditioning on classes. Very close to this approach is Text-conditioned Auxiliary Classifier GAN (TAC-GAN) (?) which conditions images on both the sentence and class information, which has been shown to improve their structural coherence. To generate images with high resolution, several GANs were stacked together yielding stackGAN (?, ?) that used a global sentence representation. This helped generate images of different sizes. To overcome the bottleneck of global-level sentence representation, attention-based GAN like AttGAN (?) was used to capture the fine-grained details at different sub-regions of the image. It pays attention to the relevant words in the natural language description.

In other research efforts, a hierarchical approach (?) was taken by inferring the semantic layout of the image. Instead of learning a direct description to an image mapping, the generation process is decomposed into multiple steps. First a semantic layout from the text is constructed by the layout generator. Then, the layout is converted to an image by the image generator. Other kinds of approaches such as HDGAN (?) aim to accommodate hierarchical adversarial objectives inside the network to regularize mid-level representations and assist generator training in order to capture complex image information. This has been shown to generate images with high resolutions.

Later, instead of dealing with natural-language descriptions, ? (?) used image-specific scene graphs enabling explicitly reasoning about objects and their relationships. Further, for obtaining better high resolution images, coarse-resolution features were taken as input and Perceptual Pyramid Adversarial Network (PPAN) was introduced to directly synthesize multi-scale images conditioned on texts in an adversarial way (?). Another approach named MirrorGAN (?) targets the main goal of visual realism and semantic consistency for generating images from text. It proposes global-local attention and semantics-preserving framework where the image generated from the text is further used to generate the text back. This has been shown to semantically align with the given text and generated description.

In the following, we explore some of the related ideas which expand the scope of language-to-image generation.

Image Manipulation.

Image manipulation takes a different path from the earlier benchmark approaches about image generation, and so the TAGAN (?) was introduced to generate semantically manipulated images while preserving text-irrelevant contents. Here, the generator learns to generate images where only regions that correspond to the given text are modified. Another interesting approach is to have an interactive system that generates an image in an iterative manner. Recent approaches (?) used attention in both the generator and the discriminator, while others (?) have designed error correction modules to rectify mismatched attributes and complete the missing contents in the generated image. There are also other variations where the source image is manipulated via natural language dialogue (?).

Fine-grained Image Generation.

Fine-grained image generation uses a recurrent image generation model (?) to take into account both the generated output up to the current step as well as all past instructions for generation. This has been shown to add new objects, apply simple transformations to existing objects, and correct previous mistakes. Earlier research never concentrated on fine-grained generation of images, i.e., localizing objects. Recently, control of the location of individual objects within an image was made possible (?) by adding a pathway in an iterative manner and applying them at different locations specified by the bounding boxes to both the generator and the discriminator.

Sequential Image Generation.

The sequential image generation approach StoryGAN (?), based on the sequential conditional GAN, concentrates on story by generating a sequence of images, when given a multi-sentence paragraph. Termed as story visualization, it behaves exactly opposite to image storytelling and has been shown to generate images with high quality, while also achieving contextual consistency.

8.1.2 Language-to-Image Generation - Datasets

For image generation, existing image datasets have been modified to accommodate image descriptions. Initially, the Oxford-102696969http://www.robots.ox.ac.uk/~vgg/data/flowers/102 and Caltech-UCSD Birds (CUB)707070http://www.vision.caltech.edu/visipedia/CUB-200-2011.html datasets consisting of flower and bird images belonging to 102 and 200 classes respectively are expanded with image descriptions (?). Table 127 and Table 128 presents splits of the datasets.

Similarly, the MSCOCO dataset (see Section 3.1.2) is also used for the reversed task of description generation, i.e., given a description, generate the image matching the description. We represent this dataset as MSCOCO-Gen. Table 129 presents the splits of the dataset.

8.1.3 Language-to-Image Generation - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Language-to-Image Generation and the results obtained by them.

Evaluation Measures.

There are different evaluation measures which are explicitly used for evaluating Language-to-Image generation models and are discussed below in detail.

•

Inception Score (IS) (?) was initially proposed to compare the quality of images generated by GAN models. A pretrained Inception-v3 model (?) is applied to the generated image to get the conditional label distribution with low entropy. A similar idea is applied for the generated images on the given text descriptions for automatic evaluation. Higher scores are better for IS.

•

Fréchet Inception Distance (FID) (?) is supposed to improve on IS by comparing the statistics of generated samples to original samples, instead of evaluating generated samples in an isolated manner. It also depends on the Inception-v3 model. In particular, the pool3 layer of the Inception-v3 is used for generating original samples for comparison. Lower FID is better as it corresponds to more similar generated and original samples.

•

R-precision is inspired from the ranking retrieval results. It is used as a complementary evaluation metric for the language-to-image generation. Specifically, generated images are used to query their corresponding natural language descriptions to find how many relevant descriptions are retrieved.

Models.

Many models have been created to approach the task of Language-to-Image Generation. In Table 130, we present some exemplar architectures (refer to Combined column) that integrate both image and language for addressing the task. We also include a column that showcases the optimization techniques used to train those models.

Results.

In Table 131, Table 132, and Table 133 we present results obtained with a subset of models built using the CUB, Oxford-102, and COCO datasets presented earlier in Section 8.1.2.

8.1.4 Language-to-Image Generation - Discussion

The Language-to-Image Generation task has been evaluated using three different datasets. The CUB and Oxford-102 datasets contain only one visual object per image, while COCO has multiple objects. Several methods based on modified GAN objectives have been proposed for the generation of an image for a given textual description. From Table 131, Table 132, and Table 133 we observe the recent MirrorGAN (?) achieves best results for different image resolution types using task-specific measures on CUB and COCO. It is built on the idea of back-translation of the image to text. However, for Oxford-102, StackGAN++ (?) achieves the best result.

8.2 Language-to-Video Generation

In the following, we discuss the setting of Language-to-Video Generation where a video is desired as the visual output from a natural language text description of the scene in video.

8.2.1 Language-to-Video Generation - Introduction

The goal of Language-to-Video generation is to mimic language-to-image generation by considering the temporal aspect. However, language-to-video generation requires a stronger conditional generator than what is generally required for the language-to-image generation. This is because of the temporal nature of the videos. To address this challenge, a conditional generative model is trained (?) to extract both static and dynamic information from text which combines variational autoencoders (VAE) (?) with GANs. Figure 20 shows the usage of natural language description to generate a video with a text-to-video generation model.

Another novel approach is to generate video from script. The composition, retrieval, and fusion network (Craft) model (?) is capable of learning knowledge from the video-description data and applying it in generating videos from novel captions. It has been shown that the Craft model performs better than the direct pixel generation approaches and generalizes well to unseen captions and to video databases with no text annotations.

8.2.2 Language-to-Video Generation - Datasets

For video generation there are no publicly available datasets. However, ? (?) have collected the Text2Video dataset belonging to ten different categories of YouTube videos, each ranging between 10-400 seconds for language-to-video generation. The categories of videos are biking in snow, playing hockey, jogging, playing soccer, playing football, kite surfing, playing golf, swimming, sailing and water skiing. For the purposes of model evaluation, the dataset is split into training, validation, and test sets in the ratio of 7:1:2 respectively, the details of which can be found in Table 134.

8.2.3 Language-to-Video Generation - Evaluation Measures, Models, and Results

In this section, we review the measures used to evaluate different models of Language-to-Video Generation and the results obtained by them.

Evaluation Measures.

The Language-to-Video Generation models are evaluated based on the Accuracy measure.

Models.

Only a limited set of models have been created so far to handle the task of Language-to-Video Generation. In Table 135, we present an exemplar architecture (refer to Combined column) which integrates video and language to address the task. We also include a column that showcases the optimization technique used to train the model.

Results.

In Table 136 we present results obtained with a subset of models built using the “TexttoVideo” dataset presented earlier in Section 8.2.2.

8.2.4 Language-to-Video Generation - Discussion

The task of Language-to-Video Generation is not as well-explored as the Language-to-Image generation task due to its complexity. Results presented in Table 136 show that the approach proposed by ? (?) achieves the best accuracy which is calculated using a simple video classifier which is a five-layer neural network model with 3D full convolutions and ReLU nonlinearities as activation functions.

9 Vision-and-Language Navigation

In this section, we explore the task of Vision-and-Language Navigation. The goal of this task is to carry out navigation in an environment by interpreting natural language instructions.

9.1 Image-and-Language Navigation

In the following, we provide a detailed description of the Image-and-Language Navigation task in which photorealistic images forming 3D environments are used as visual inputs.

9.1.1 Image-and-Language Navigation - Introduction

Most of the attempts at Vision-and-Language Navigation (VLN) use photorealistic images forming 3D environments. The goal of the Image-and-Language Navigation (ILN) task is to enable an autonomous agent (e.g., robot) to carry out navigation in an environment defined by the photo-realistic image views by means of interpreting natural language instructions (?). This requires the agent/robot to simultaneously process both vision and language inputs and navigate from a source to a target location. Figure 21 shows a visual depiction of the ILN task.

Initially, sequence-to-sequence models were proposed to address challenges in which the student-forcing approach achieved promising results in previously explored environments. One approach (?) integrated a module to combine model-based and model-free reinforcement learning techniques to better generalize to unseen environments. There is also the reinforced cross-modal matching approach (?), which enforces both local and global cross-modal grounding via reinforcement learning.

ILN can also be viewed as a search on a navigation graph (?) with a progress monitor as a learnable heuristic for search. It is improved by leveraging a visual-textual co-grounding attention mechanism to better align the instructions and visual scenes, and incorporates a progress monitor to estimate the agent’s current progress towards the goal (?). Another substantial improvement came from training an action space with an embedded speaker model (?). New instructions are synthesized for data augmentation and pragmatic reasoning was implemented for evaluating how well candidate action sequences explain an instruction. Improving over earlier approaches that make local action decisions or score entire trajectories using beam search, the novel approach of the FAST framework (?) balances local and global signals when exploring the environment allowing it to act greedily, but use global signals to backtrack when necessary. Also, ? (?) explore a generalizable navigational agent by training it in two stages. In the first stage, mixed imitation and reinforcement learning is combined, while in the second stage, fine-tuning is performed via newly-introduced “unseen” triplets.

ILN can also be perceived as a form of visual question answering (see Section 5.1) that requires navigation to answer questions. Embodied Question Answering (?, ?) is explored with an agent that is spawned at a random location in a 3D environment and asked a question. For answering the question, the agent navigates through the 3D environment, for finding the information observed in the question. Other attempts used interactive question answering (?) and grounded dialog (?). Another set of approaches (?) aims to map instructions to actions in 3D Environments with visual goal prediction. Recently, ? (?) also made an interactive learning framework to endow the agent with the ability to ask for users’ help in ambiguous situations.

9.1.2 Image-and-Language Navigation - Datasets

For the image-and-language navigation task, three different datasets have been designed so far. In the following, we present the details of these datasets in separate paragraphs.

Room-2-Room (R2R).

The R2R717171https://bringmeaspoon.org (?) dataset consists of real images of previously unseen building-scale 3D environments from Matterport3D (?). The navigation instructions have been collected with the help of humans using AMT. Table 137 presents splits of the dataset.

ASKNAV.

Similar to R2R, the ASKNAV727272https://github.com/debadeepta/vnla (?) dataset is built on top of Matterport3D737373https://niessner.github.io/Matterport . However, the objective differs in that the agent queries the advisor when in confusion and makes progress accordingly. It contains 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. A data point in the dataset consists of a single starting viewpoint, but it has multiple goal viewpoints. Table 138 presents the splits of dataset.

TOUCHDOWN.

Extending from building environments, the TOUCHDOWN747474https://github.com/lil-lab/touchdown (?) dataset is designed for addressing tasks such as executing navigation instructions (Navigation Only) and resolving spatial descriptions (SDR) in real-world environments. SDR is similar to the task of image referring expression (Section 4.1).

The environment includes 29,641 panoramas (360*∘* Google Street View RGB images) and 61,319 edges from the New York City. Table 139 has more details about the dataset, while Table 140 presents its splits.

Cooperative Vision-and-Dialog Navigation (CVDN).

CVDN757575https://cvdn.dev (?) is a dataset767676https://github.com/mmurray/cvdn/tree/master/tasks/CVDN/data of embodied, human-human dialogs situated in a simulated, photorealistic home environment. Table 141 presents some statistics about the dataset.

Action Learning From Realistic Environments and Directives (ALFRED).

ALFRED777777https://askforalfred.com (?) is a benchmark and interactive visual dataset for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.

9.1.3 Image-and-Language Navigation - Evaluation Measures, Models, and Results

In this section, we present the evaluation measures, models, and results achieved with various architectures of Image-and-Language Navigation.

Evaluation Measures.

The measures that are designed explicitly for the Image-and-Language Navigation system (e.g., R2R) are:

•

Path Length (PL): PL is a trajectory length where it is the total length of the executed path.

•

Navigation Error (NE): NE is based on the shortest path distance in the navigation graph, and is calculated by measuring the average distance between the end-location predicted by the follower agent and the true route’s end-location.

•

Success Rate (SR): SR is the percentage of predicted end-locations within 3 meters of the true location.

•

Oracle Success Rate (OSR): OSR measures the success rate at the closest point to the goal that the agent has visited along the trajectory.

•

Success Path Length (SPL): SPL is a trade-off between SR and PL, by weighting SR by inverse PL.

Models.

Many models have been created to approach the task of Image-and-Language Navigation. In Table 143, we present some exemplar architectures (refer to Combined column) which integrate both image and language to address the task. We also include a column that showcases the optimization techniques used to train those models.

Results.

As discussed earlier several models have been created to approach the task of Image-and-Language Navigation. Furthermore, many datasets have been created to provide variety in the content so that they improve the generalization ability of the models. In this section, we cover the results obtained by the models from a representative dataset for this task. Table 144 – 146 present results obtained with a subset of models built and evaluated using the R2R dataset which was introduced in Section 9.1.2.

9.1.4 Image-and-Language Navigation - Discussion

Image-and-Language Navigation is evaluated with different splits of the R2R validation and test datasets. From Table 144, Table 145, and Table 146 we can observe that Frontier Aware Search with backTracking (FAST)-beam (?) achieves the best result on the task-specific metrics. This approach balances local and global signals while exploring an unobserved environment. It also helps to act greedily but use global signals to backtrack whenever necessary.

10 Vision-and-Language Pretraining

Inspired by the works of pretraining only on vision (?) or solely on language data (?, ?, ?), the vision-and-language pretraining seeks to jointly learn representations using both visual and textual content for improving the efficiency of previously discussed vision and language integration tasks. Several methods will be discussed for vision-and-language pretraining, the architectures of which can be broadly divided into Single-stream and Two-stream. In the following, we provide more details on both types of architectures.

Single-stream Architectures.

These neural architectures are based on BERT-like (?) models where they incorporate an Image Embedder, a Text Embedder, and a multi-layer Transformer (?). The proposed models are pretrained on data which in general have parallel multimodal components i.e., videos or images along with captions. Further, the models are optimized with a combination of different objectives such as vision-based and text-based Masked Language Models (MLM), masked visual-feature modeling, and visual-linguistic matching. Learned representations are then used for different downstream tasks such as multimodal understanding or generation. For example, the VideoBERT (?) architecture has been designed to learn vision-language representations for a generative downstream task like video description generation (see Section 3.1.4). While there are several other approaches such as Bounding Boxes in Text Transformer (B2T2) (?), Unicoder-VL (?), VL-BERT (?), UNITER (?) are all designed for multimodal understanding and facilitate downstream tasks. Works such as VLP (?), OSCAR (?) and also its extension VinVL (?) have built unified models that can jointly understand and generate from cross-modal data. There is also an emergence of interest in probing vision-and-language pretrained models (?) to comprehend the contribution from each modality and also help in designing better model architectures and objectives.

Two-stream Architectures.

In contrast to the single-stream architectures, two-stream architectures adopted two independent encoders for learning visual and text representations. ViLBERT (?) and LXMERT (?) are examples of two-stream architectures which used self-attention principles to jointly learn representations from visual and textual data. ViLBERT builds a co-attentional transformer layer, while LXMERT uses a cross-modality encoder. Similar to single-stream, the two-stream architectures also optimize their models with pretraining tasks, such as MLM and vision-text matching. Sometimes they use additional text-only corpora for achieving better generalization on long and complex sentences.

In Table 147, we summarize both Single-stream and Two-stream architectures by enumerating the vision and language integration tasks they support. It has to be noted that these architectures only use subsets of the datasets from each task. Also, the type of tasks they select are limited and are mostly discriminative. Broadly, we denote with (✓) or (✗) to indicate whether they support the task in question or not.

11 Future Directions

The integration of vision and language research has come a long way since the pioneering works, particularly after the adoption of deep learning techniques. Although the performance of current state-of-the-art models still needs to catch up with human abilities, the gap is diminishing at a steady rate. However, there is still ample room for theoretical and algorithmic improvements. Here, we enumerate several possible future directions that have the potential to advance the research overall.

Learning Common Sense and World Knowledge.

There is a vast amount of out-of-domain data available which is unpaired with vision and language task-specific corpora. Leveraging such information as factual, hierarchical, or commonsense knowledge can significantly improve the intelligence of vision and language systems. Prior works have been shown to assist independent NLP tasks with pretrained language models such as commonsense reasoning (?) and fact predictions (?). It has also shown promise for image caption generation (?, ?) and question answering (?, ?). Extending such ideas to other tasks would be an interesting research direction to pursue. Another possibility could be to utilize images, videos, and text in a synchronous and synergistic manner as they encode different aspects of the world and implicitly. Here, an open question would be how to extract world and common sense knowledge from these sources.

Addressing Large-scale Data Limitations.

Most approaches designed for tasks that integrate vision and language use large datasets for training. With this trend, it will soon become harder to design new tasks without having a dataset. To avoid such problems, future work will need to be adaptable to datasets of different sizes. Therefore, trade-off approaches are required where we know what amount of data is enough to master a certain task. This requires designing methods which might inspire from neuro-symbolic reasoning systems (?, ?).

Combining Multiple Tasks.

Some tasks are capable of sharing some ideas or representations of each other. For example, visual referring expression comprehension can be viewed as a visual dialog task (?) where a sequence of questions is used to refer to an object in the image. Similarly, image caption generation can be viewed as the visual referring expression generation task (?).

Novel Neural Architectures for Representation.

Up until late 2017, the de facto standard for learning language and visual feature representations were RNNs and CNNs respectively. However, over the last few years, with the introduction of novel ideas that address the limitations of aforementioned neural network types, either theoretically or computationally, there is a growing interest to adopt these new techniques. For instance, the Transformer (?) architecture that is used extensively for pure NLP tasks may see adoption for the integration of vision and language tasks. It has already shown its applicability for image caption generation (?). In a similar manner, graph neural networks (?, ?, ?) that were introduced to tackle graph-structured data, has already shown its promise in visual reasoning (?). Exploiting the compositionality of visual objects to describe an entire visual scene with neural modular networks is also an interesting direction to explore for many vision and language tasks.

Image vs Video.

Most of the research into integrating vision and language concentrates on static images. This trend is clearly evident from the array of datasets and methods available for image and language integration tasks. Nevertheless, although a complex task, similar attention needs to be embraced for videos for which there is a scarcity of datasets. For instance, there is only one dataset available for tasks such as Video Dialog (Section 6.2), Video Referring Expression (Section 4.2), Language-to-Video Generation (Section 8.2), and Machine Translation with Videos (Section 7.2), while tasks such as Vision-and-Language Navigation (Section 9) completely lack video-based datasets.

3D-Vision and Language.

The world that we inhabit is inherently 3D. Thinking from this perspective, restricting vision and language research to just 2D, viz. images and videos, might be a hindrance for real world agents, e.g., humanoid robots, to fully understand the complexities of the 3D world and navigate with ease. To avoid such pitfalls, algorithms and techniques need to be developed for processing 3D inputs such as RGB-D, meshes, and point clouds in conjunction with language. Some pioneering works have already begun in this direction (?, ?, ?, ?) and we anticipate the trend787878https://language3dscenes.github.io to shift more towards developing algorithms for understanding as well as the generation of 3D scenes (?), while utilizing language as a main or auxiliary modality.

Automatic Evaluation Measures.

Automatic evaluation measures exist for several vision and language tasks. However, most of them are adaptations from standalone NLP tasks such as machine translation. For example, BLEU and METEOR metrics used for evaluating visual caption generation and storytelling models have been found not to correlate well with human judgements (?). The SPICE metric designed specifically for visual caption generation is dependent on parsing and is, therefore, not adaptable for other tasks such as storytelling. This kind of shortcoming shows us a promising research direction to pursue in developing evaluation measures applicable for several tasks. Recent attempts in developing BLEURT (?) and BERTScore (?) metrics show promising direction towards this goal. Analogously, language-to-vision generation, although having quantitative measures, is typically dependent on human evaluation. It needs to adopt novel techniques for effective quantitative evaluation. Other tasks such as vision-and-language navigation and visual reasoning have specific measures for evaluation which can be improved further.

12 Conclusion

In undertaking this survey, we provided an overview as well as elaborate details on the recent trends in integration of vision and language research. In the beginning, we started with a background on various tasks in computer vision and NLP. Then, we identified ten distinct prominent tasks that aim to integrate visual and language modalities. To draw connections from traditional research tasks to V&L integration tasks, we presented information about how each integration task is expanded from the standalone computer vision or NLP tasks on which they are originally based. Following that, we reviewed and analyzed each task separately by presenting a comprehensive introduction on how the tasks are designed in a bottom-up manner. Additionally, we presented different state-of-the-art methods used to address the tasks, along with exemplar architectures that are designed to integrate vision and language representations. We also provided a review on relevant datasets, evaluation measures, and the relative performance obtained by several state-of-the-art methods. Finally, in a separate section, we explored the various ways to pretrain generic models with large-scale multimodal data for supporting downstream vision and language integration tasks with minimal fine-tuning efforts. Moreover, we outlined how much the existing pretraining approaches support the ten prominent integration tasks that we described in earlier sections.

When comparing the standalone research done individually in the fields of computer vision and NLP, the synergy of both, fuelled by advanced machine learning techniques, are expected to yield more intelligent and sustainable systems. Making them easily accessible can, therefore, have direct commercial and societal impact. However, despite the significant progress achieved so far in many integration tasks, large-scale evaluation of those systems show that they still fall behind human performance, by a large margin. This fact confirms that there is still a good deal of room for improvement. In particular, designing novel evaluation measures and architectures that can adequately deal with the complexity of vision and language integration problems has the potential to address some of the challenges. Towards this goal, we outlined a few possible future research directions in the final section.

We believe that our efforts in publishing this survey will help to systematize future research papers and also investigate the unsolved problems that are hindering the progress of effective integration of vision and language modalities.

Acknowledgments

This work was supported by the German Research Foundation (DFG) as a part of - Project-ID 232722074 - SFB1102. We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript. We also acknowledge the insightful comments of Marius Mosbach on the first version of the manuscript.

See pages - of abbreviations-appendix.pdf

Bibliography443

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aafaq et al. Aafaq, N., Mian, A., Liu, W., Gilani, S. Z., and Shah, M. (2020). Video description: A survey of methods, datasets, and evaluation metrics. ACM Comput. Surv. , 52 (6), 115:1–115:37.
2Achlioptas et al. Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., and Guibas, L. (2020). Referit 3d: Neural listeners for fine-grained 3D object identification in real-world scenes. In 16th European Conference on Computer Vision (ECCV), August 23-28, 2020 . Springer.
3Aditya et al. Aditya, S., Saha, R., Yang, Y., and Baral, C. (2019). Spatial knowledge distillation to aid visual reasoning. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 227–235. IEEE.
4Agrawal et al. Agrawal, A., Batra, D., Parikh, D., and Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 4971–4980. IEEE Computer Society.
5Agrawal et al. Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Parikh, D., and Batra, D. (2017). Vqa: Visual question answering. International Journal of Computer Vision , 123 (1), 4–31.
6Agrawal et al. Agrawal, H., Anderson, P., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., and Lee, S. (2019). nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 , pp. 8947–8956. IEEE.
7Agrawal et al. Agrawal, H., Chandrasekaran, A., Batra, D., Parikh, D., and Bansal, M. (2016). Sort story: Sorting jumbled images and captions into stories. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pp. 925–931.
8Alamri et al. Alamri, H., Cartillier, V., Das, A., Wang, J., Cherian, A., Essa, I., Batra, D., Marks, T. K., Hori, C., Anderson, P., Lee, S., and Parikh, D. (2019 a). Audio visual scene-aware dialog. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pp. 7558–7567. Computer Vision Foundation / IEEE.