TL;DR
BERTGEN is a multi-task, decoder-only generative model that combines multimodal and multilingual pretraining to excel in image captioning, translation, and zero-shot language generation, outperforming many baselines.
Contribution
It introduces BERTGEN, a novel model that fuses VL-BERT and M-BERT for multi-task generation, demonstrating effective transfer and zero-shot capabilities.
Findings
Outperforms strong baselines across tasks
Shows competitive zero-shot language generation
Benefits significantly from multi-task training
Abstract
We present BERTGEN, a novel generative, decoder-only model which extends BERT by fusing multimodal and multilingual pretrained models VL-BERT and M-BERT, respectively. BERTGEN is auto-regressively trained for language generation tasks, namely image captioning, machine translation and multimodal machine translation, under a multitask setting. With a comprehensive set of evaluations, we show that BERTGEN outperforms many strong baselines across the tasks explored. We also show BERTGEN's ability for zero-shot language generation, where it exhibits competitive performance to supervised counterparts. Finally, we conduct ablation studies which demonstrate that BERTGEN substantially benefits from multi-tasking and effectively transfers relevant inductive biases from the pre-trained models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Visual-Linguistic BERT · Adam · Linear Warmup With Linear Decay · Residual Connection · WordPiece · Attention Dropout
