TL;DR
Flex-Judge introduces a reasoning-guided approach that uses minimal textual data to create a versatile, cost-effective multimodal evaluator capable of generalizing across diverse tasks and modalities.
Contribution
It proposes a novel reasoning-based framework that enables a single judge model to generalize across multiple modalities with minimal training data.
Findings
Achieves competitive performance with fewer training resources.
Outperforms some commercial multimodal evaluators.
Effective in resource-scarce domains like molecular evaluation.
Abstract
Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
