Data Assessment for Embodied Intelligence
Jiahao Xiao, Bowen Yan, Jianbo Zhang, Jia Wang, Chunyi Li, Zhengxue Cheng, Guangtao Zhai

TL;DR
This paper introduces new data-driven tools to evaluate and improve the diversity and learnability of embodied intelligence datasets, addressing key challenges in dataset assessment.
Contribution
The paper presents a unified multimodal representation and two novel algorithms for quantifying dataset diversity and learnability without training models.
Findings
The diversity entropy effectively measures dataset information content.
The learnability algorithm provides immediate, interpretable assessments.
Validated on real-world datasets, it offers actionable insights.
Abstract
In embodied intelligence, datasets play a pivotal role, serving as both a knowledge repository and a conduit for information transfer. The two most critical attributes of a dataset are the amount of information it provides and how easily this information can be learned by models. However, the multimodal nature of embodied data makes evaluating these properties particularly challenging. Prior work has largely focused on diversity, typically counting tasks and scenes or evaluating isolated modalities, which fails to provide a comprehensive picture of dataset diversity. On the other hand, the learnability of datasets has received little attention and is usually assessed post-hoc through model training, an expensive, time-consuming process that also lacks interpretability, offering little guidance on how to improve a dataset. In this work, we address both challenges by introducing two…
Peer Reviews
Decision·Submitted to ICLR 2026
- The authors provide the first quantitative and interpretable metrics for assessing embodied dataset diversity and learnability without model retraining. - They demonstrate strong empirical validation across diverse datasets, suggesting robustness.
- The reliance on CLIP as a universal multimodal encoder limits generality; alternative embeddings (e.g., OpenVLA latent space) could yield different results. - The proposed metrics are heuristic approximations, not theoretically guaranteed proxies for model learnability. - Real-world validations are limited (two UR5 datasets); broader experimental diversity would strengthen claims. - Diversity entropy depends heavily on bandwidth and kernel choice, yet sensitivity analysis is missing. - The lea
- While data valuation has been an established area, I really like the author's systematic approach of dividing valuation into diversity and learnability, and analyze the two holistically to give a better picture of the quality of the data. The topic of data valuation is also relevant in building VLA foundation models, as VLA training data is often noisy and focus on a narrow range of tasks. - The diversity entropy based on Parzen window estimation and the learnability factors are rigorously de
- From my understanding, the contribution is primarily on developing a new data valuation method. If this is the case, the author should compare the method to other data valuation methods in the related works section. Currently, the related works section only covers Embodied datasets and VLA Models, which are not very useful to understand the contribution of the work. - While the abstract claims the model use "unified multimodal representation", this representation is in fact just video frames,
This work introduces a means to evaluate the quality of the datasets that are central to learning based methods in embodied agents. This is very crucial since the agent performance depends on these datasets. I believe this work will provide insights for future work.
Overall the paper is well written and easy to understand. I didn’t find any significant weaknesses. However, I have a few clarifying questions listed below.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAction Observation and Synchronization · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
