Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning
Carl Qi, Xiaojie Wang, Silong Yong, Stephen Sheng, Huitan Mao, Sriram Srinivasan, Manikantan Nambi, Amy Zhang, Yesh Dattatreya

TL;DR
This paper introduces ARMOR, a multi-task self-refinement model that improves robotic failure detection and reasoning by leveraging heterogeneous supervision and iterative prediction, achieving state-of-the-art results in diverse environments.
Contribution
The paper presents ARMOR, a novel adaptive multi-task model that enhances robotic failure detection and reasoning through iterative self-refinement and heterogeneous supervision learning.
Findings
Up to 30% improvement in failure detection rate
Up to 100% improvement in reasoning accuracy
Robust performance across diverse environments
Abstract
Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning. We formulate detection and reasoning as a multi-task self-refinement process, where the model iteratively predicts detection outcomes and natural language reasoning conditioned on past outputs. During training, ARMOR learns from heterogeneous supervision - large-scale sparse binary labels and small-scale rich reasoning annotations - optimized via a combination of offline and online imitation…
Peer Reviews
Decision·ICLR 2026 Poster
1. ARMOR reframes robotic failure detection and reasoning as a multi-round, multi-task self-refinement problem. Instead of predicting both tasks jointly in one forward pass, it trains the model to condition on its own past predictions (for both detection and reasoning) and iteratively improve them, which I believe is a interesting approach. 2. The proposed refinement framework is not limited to failure detection. Its iterative conditioning mechanism and multi-task training setup could naturally
1. All reported numbers appear to be single-run result, no mention of multiple seeds, error bars, or statistical significance. Given that inference involves stochastic sampling of trajectories (M=3), performance could vary substantially between runs. Without variance or confidence intervals, large reported gains (up to +100% in reasoning) cannot be verified as statistically meaningful. 2. The paper only compares to generative VLMs (Qwen2.5-VL, Cosmos-Reasoning, LLaVA-NeXT, Claude-3.7). It omits
- The motivation is clear and the addressed issue is very relevant for robotics. - The multi-task training and conditional supervision on type of available ground truth is well justified for the robotics domain.
### The clarity and methodology of the proposed method present several concerns - The second stage of offline imitation seems to just reproduce the textual input. If I understand correctly, the model is provided with the ground truth labels in textual format and then learns to reproduce these outputs. This could possibly lead to the model ignoring the visual modality, making the initial training stage obsolete. - The ablation “Multitask Prediction” also indicates this. The model almost perform
- The paper addresses an under-explored but crucial problem, robotic failure detection and reasoning. - The proposed multi-round adaptive refinement allows the model to iteratively improve its predictions, similar to human introspection, and provides more coherent reasoning explanations. - The authors performed extensive experiments on four diverse robotic datasets (RLBench, ManiSkill, Sparrow, ARMBench), showing clear and consistent gains over strong baselines in both detection accuracy and
- How does the multi-round refinement introduce additional inference cost proportional to the number of rounds? It can be helpful to also include some latency or runtime statistics. In real-time robotic applications, providing timely feedback can be challenging without interrupting task execution. Similarly, the dependency on large pre-trained language models raises deployment challenges for real-time execution. - For the failure detection of these safety-critical tasks, using these large mode
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
