Image Quality Assessment for Embodied AI
Chunyi Li, Jiaohao Xiao, Jianbo Zhang, Farong Wen, Zicheng Zhang, Yuan Tian, Xiangyang Zhu, Xiaohong Liu, Zhengxue Cheng, Weisi Lin, Guangtao Zhai

TL;DR
This paper introduces a new Image Quality Assessment (IQA) framework tailored for Embodied AI, addressing the gap in evaluating image usability for robotic tasks under real-world distortions.
Contribution
It constructs a comprehensive perception-cognition-decision pipeline, creates a large Embodied-IQA database, and evaluates existing IQA methods for embodied applications.
Findings
Mainstream IQA methods perform poorly on Embodied-IQA.
The Embodied-IQA database contains over 36k image pairs with 5 million annotations.
Need for developing specialized IQA methods for Embodied AI.
Abstract
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations…
Peer Reviews
Decision·ICLR 2026 Poster
- The idea of IQA for machines vs humans is established in prior works. This paper reframes the problem for robotics, the stage of how degradation of images affects robot task execution, not just visual recognition. - The paper was interesting to read, with extensive experiments and detailed analysis. - The Embodied-IQA database is very large, containing 36,900 image pairs and over 5 million fine-grained annotations, having good scale. It is also annotated along three unique axes, reflecting dif
- The pipeline assumes vision is the dominant modality, neglecting that in true Embodied AI, perception often must fuse audio, tactile, and temperature cues. Is it possible that temperature might play a role on how bright the image might be? - The writing is often verbose and repeats technical claims across sections, particularly regarding pipeline design and dataset composition.
1. The paper presents a clear motivation and significant innovation. It defines the image quality assessment (IQA) problem in embodied intelligence as “image usability for robots,” transcending traditional frameworks based on human or machine vision systems (HVS/MVS). It innovatively models the robot's “decision-making” and “execution” phases explicitly. 2. The research exhibits high quality. First, its constructed dataset is exemplary in scale (36k+ images, 5m+ labels), breadth (30 distortion t
This study exhibits several critical weaknesses. 1. Methods that evaluate differences based on metrics may inherit and amplify inherent biases and errors within the model and its assessment indicators. Specifically, using metrics like BLEU/ROUGE to measure cognitive comprehension is highly sensitive to phrasing and redundancy, potentially failing to accurately reflect task equivalence. Adopting structured patterns (e.g., action-parameter tuples) or task success classifiers may be more robust alt
1. The paper identifies and clearly defines a completely new, critical, and timely research problem: assessing image quality for Embodied AI. Its theoretical framework based on the "Mertonian system" to differentiate RVS, MVS, and HVS is highly novel and persuasive, laying a solid theoretical foundation for this new field. 2. The paper's main contribution—the Embodied-IQA dataset—is an extremely valuable resource. Its scale and granularity are unprecedented in the IQA field. This dataset will l
1. The paper defines the VLA "Decision" score as a simple average of errors in three dimensions: Position, Rotation, and State. This metric seems overly simplistic. In real robotics tasks, the importance of these three dimensions can be highly imbalanced (e.g., a minor rotation error could cause catastrophic failure, while a larger position error might still be acceptable). 2. As a benchmark paper, its primary duty is to define the problem and provide data, which it does exceptionally well. Howe
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques · Explainable Artificial Intelligence (XAI) · AI in cancer detection
