A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation

Siddharth Betala; Kushan Raj; Vipul Betala; Rohan Saswade

arXiv:2511.07010·cs.CL·November 11, 2025

A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation

Siddharth Betala, Kushan Raj, Vipul Betala, Rohan Saswade

PDF

Open Access

TL;DR

This paper introduces a multimodal judge-corrector system that uses vision and language models to automatically detect and fix errors in training data for English-to-Indic machine translation, improving translation quality.

Contribution

The paper presents a novel vision-guided judge-corrector pipeline for data correction and demonstrates that fine-tuning with corrected data enhances translation performance across multiple Indic languages.

Findings

01

Automated correction improved caption quality by 17.1% on average.

02

Fine-tuning on corrected data increased BLEU scores in multiple language pairs.

03

The system effectively identifies and rectifies translation errors using multimodal models.

Abstract

In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at WAT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning. Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis