Alt-Text with Context: Improving Accessibility for Images on Twitter
Nikita Srivatsan, Sofia Samaniego, Omar Florez, Taylor, Berg-Kirkpatrick

TL;DR
This paper introduces a multimodal model that generates context-aware alt-text for Twitter images by leveraging both image content and associated tweet text, significantly improving descriptive accuracy.
Contribution
It presents a new dataset and a multimodal approach that outperforms previous methods by effectively combining visual and textual social media context.
Findings
Our model doubles BLEU@4 scores compared to prior work.
Leveraging tweet text improves alt-text relevance and accuracy.
The dataset enables robust evaluation of social media image captioning.
Abstract
In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. More than just a special case of image captioning, alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative. We address this task with a multimodal model that conditions on both textual information from the associated social media post as well as visual signal from the image, and demonstrate that the utility of these two information sources stacks. We put forward a new dataset of 371k images paired with alt-text and tweets scraped from Twitter and evaluate on it across a variety of automated metrics as well as human…
Peer Reviews
Decision·ICLR 2024 poster
1. The collected dataset is important and useful. The data preprocess ensure its usability. 2. The research problem raised in this paper is important.
1. The novelty of the proposed method is really limited excpet the tweet-text-based reranking. 2. The experiment is somewhat not extensive. For example, from Table 1, it seems that the tweet-based reranking is the most important component. But the authors did not tried to incorporate the reranking with the baselines, which is not fair.
The author points out that in social media, images are often posted in addition to textual information, but there is little information describing the images, and in such situations, information about the images is not conveyed by text-to-speech software for the visually impaired, for example. I agree with this point and understand its importance as a study. As for the proposed method, its basic structure consists of encoding images using CLIP and generating text using GPT-2. This structure its
As mentioned in the "Strengths" section, the focus on "alt-text" is highly evaluated, but there is room for improvement in that "alt-text" is not clearly defined in the paper. In the evaluation dataset, the "alt-text" entered by twitter users is used as the correct answer, but it is written as "alt-text captions on Twitter are written by untrained users, they can be noisy, inconsistent in form and specificity, and occasionally do not even describe the image contents" in the paper, and it seems
First of all, I would like to applaud the authors for working on this important and timely problem. I believe that this research is very important and can have the potential to improve the lives and online experience of many people with visual impairments. Overall, I believe that this research focuses on an important problem, and there is potential for a big impact. Second, the paper collects a large-scale dataset of images and user-generated accessibility captions from Twitter; this dataset is
I have several concerns with the paper, mainly related to the lack of gold standards for accessibility captions, the lack of important and adequate methodological details, the paper’s evaluation, the paper’s approach to releasing data, and the paper’s ethical considerations. First, there is a disconnection between the paper’s motivation and how the paper evaluates the performance of the proposed method. I agree with the paper’s motivation that the user-generated accessibility captions are of qu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsContrastive Language-Image Pre-training
