On Advances in Text Generation from Images Beyond Captioning: A Case   Study in Self-Rationalization

Shruti Palaskar; Akshita Bhagia; Yonatan Bisk; Florian Metze; Alan W; Black; Ana Marasovi\'c

arXiv:2205.11686·cs.CL·October 25, 2022

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W, Black, Ana Marasovi\'c

PDF

Open Access

TL;DR

This paper investigates how multimodal models perform in complex text generation tasks involving images and text, revealing that current models lack universal effectiveness and highlighting the need for new approaches.

Contribution

The study critically evaluates existing multimodal models for self-rationalization across various tasks, showing their limitations and emphasizing the necessity for novel backbone architectures.

Findings

01

Recent unimodal advances do not consistently improve multimodal self-rationalization.

02

No single model type outperforms others across all tasks and datasets.

03

Current models are insufficient for general text generation from images beyond captioning.

Abstract

Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these models work for more complex generative tasks, i.e. conditioning on both text and images? Are multimodal models simply visually adapted language models, or do they combine they reason jointly over modalities? We investigate these questions in the context of self-rationalization (jointly generating task labels/answers and free-text explanations) of three tasks: (i) visual question answering in VQA-X, (ii) visual commonsense reasoning in VCR, and (iii) visual-textual entailment in e-SNLI-VE. We show that recent unimodal advances, CLIP image representations and scaling of language models, do not consistently improve self-rationalization in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training