Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning

Siddharth Betala; Ishan Chokshi

arXiv:2409.15052·cs.CL·November 11, 2025

Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning

Siddharth Betala, Ishan Chokshi

PDF

Open Access

TL;DR

This paper introduces a novel cross-lingual image captioning method that uses large language models to generate contextual conversations, improving translation quality without traditional training.

Contribution

The authors propose leveraging instruction-tuned prompting of LLMs to create synthetic conversations for cross-lingual captioning, avoiding traditional training or fine-tuning.

Findings

01

Achieved 37.90 BLEU on English-Hindi challenge set.

02

Ranked first and second for English-Hausa on leaderboards.

03

Explored trade-offs between BLEU scores and semantic similarity.

Abstract

In this paper, we describe our system under the team name Brotherhood for the English-to-Lowres Multi-Modal Translation Task. We participate in the multi-modal translation tasks for English-Hindi, English-Hausa, English-Bengali, and English-Malayalam language pairs. We present a method leveraging multi-modal Large Language Models (LLMs), specifically GPT-4o and Claude 3.5 Sonnet, to enhance cross-lingual image captioning without traditional training or fine-tuning. Our approach utilizes instruction-tuned prompting to generate rich, contextual conversations about cropped images, using their English captions as additional context. These synthetic conversations are then translated into the target languages. Finally, we employ a weighted prompting strategy, balancing the original English caption with the translated conversation to generate captions in the target language. This method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsSparse Evolutionary Training