Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
Qinwu Xu, Zhuoheng Li, Jessie Salas

TL;DR
This paper introduces a robust, multi-stage framework for checkpoint selection in multimodal large language models, emphasizing uncertainty estimation and data quality to improve evaluation reliability.
Contribution
It proposes a novel decision-oriented evaluation system combining real-world data, structured judgment, and ranking protocols with confidence estimation techniques.
Findings
Enhanced checkpoint selection accuracy in OCR-heavy scenarios.
Improved robustness against evaluation noise and marginal performance differences.
Highlighting the importance of data quality, especially OCR readability, for evaluation validity.
Abstract
Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
