TL;DR
MetaFine is a diagnostic framework that dissects fine-grained manipulation skills into understanding, perception, and controlled behavior, revealing hidden model weaknesses and guiding targeted improvements.
Contribution
The paper introduces MetaFine, a novel diagnostic evaluation framework that reconstructs heterogeneous benchmarks into diagnostic scenarios, exposing model failures and guiding targeted enhancements.
Findings
Visual encoder quality is a key bottleneck for fine-grained manipulation.
MetaFine reveals severe dimension-specific failures in state-of-the-art models.
Hybrid real-sim validation improves physical benchmarking stability.
Abstract
Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
