Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset
Gabriel Bromonschenkel, Alessandro L. Koerich, Thiago M. Paix\~ao, Hil\'ario Tomaz Alves de Oliveira

TL;DR
This study evaluates Transformer-based vision and language models for Brazilian Portuguese image captioning using native and translated datasets, highlighting model performance, biases, and the impact of translation on model generalization.
Contribution
It introduces a cross-native-translated dataset evaluation for Brazilian Portuguese image captioning and compares multiple models, including a new pre-trained VLM, revealing insights into translation effects and biases.
Findings
Swin-DistilBERTimbau outperforms other models in cross-dataset generalization.
ViTucano surpasses larger multilingual models in traditional metrics.
GPT-4 achieves the highest CLIP-Score, indicating strong image-text alignment.
Abstract
Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Text Readability and Simplification · Subtitles and Audiovisual Media
