MJ1: Multimodal Judgment via Grounded Verification
Bhavesh Kumar, Dylan Feng, Leonard Tang

TL;DR
This paper introduces MJ1, a reinforcement learning-based multimodal judge that improves visual grounding and decision accuracy in multimodal tasks without increasing model size, outperforming larger models.
Contribution
MJ1 is the first to integrate a structured grounded verification chain and counterfactual consistency reward into multimodal judgment, enhancing accuracy and grounding.
Findings
MJ1 improves accuracy on MMRB2 by +3.8 and +1.7 points without training.
MJ1 with 3B parameters surpasses larger models like Gemini-3-Pro.
Grounded verification and consistency training significantly boost multimodal judgment.
Abstract
Multimodal judges struggle to ground decisions in visual evidence. We present MJ1, a multimodal judge trained with reinforcement learning that enforces visual grounding through a structured grounded verification chain (observations claims verification evaluation scoring) and a counterfactual consistency reward that penalizes position bias. Even without training, our mechanism improves base-model accuracy on MMRB2 by +3.8 points on Image Editing and +1.7 on Multimodal Reasoning. After training, MJ1, with only 3B active parameters, achieves 77.0% accuracy on MMRB2 and surpasses orders-of-magnitude larger models like Gemini-3-Pro. These results show that grounded verification and consistency-based training substantially improve multimodal judgment without increasing model scale.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
