MJ1: Multimodal Judgment via Grounded Verification

Bhavesh Kumar; Dylan Feng; Leonard Tang

arXiv:2603.07990·cs.LG·March 25, 2026

MJ1: Multimodal Judgment via Grounded Verification

Bhavesh Kumar, Dylan Feng, Leonard Tang

PDF

Open Access 1 Models

TL;DR

This paper introduces MJ1, a reinforcement learning-based multimodal judge that improves visual grounding and decision accuracy in multimodal tasks without increasing model size, outperforming larger models.

Contribution

MJ1 is the first to integrate a structured grounded verification chain and counterfactual consistency reward into multimodal judgment, enhancing accuracy and grounding.

Findings

01

MJ1 improves accuracy on MMRB2 by +3.8 and +1.7 points without training.

02

MJ1 with 3B parameters surpasses larger models like Gemini-3-Pro.

03

Grounded verification and consistency training significantly boost multimodal judgment.

Abstract

Multimodal judges struggle to ground decisions in visual evidence. We present MJ1, a multimodal judge trained with reinforcement learning that enforces visual grounding through a structured grounded verification chain (observations $\to$ claims $\to$ verification $\to$ evaluation $\to$ scoring) and a counterfactual consistency reward that penalizes position bias. Even without training, our mechanism improves base-model accuracy on MMRB2 by +3.8 points on Image Editing and +1.7 on Multimodal Reasoning. After training, MJ1, with only 3B active parameters, achieves 77.0% accuracy on MMRB2 and surpasses orders-of-magnitude larger models like Gemini-3-Pro. These results show that grounded verification and consistency-based training substantially improve multimodal judgment without increasing model scale.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
haizelabs/mj1
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning