NavTrust: Benchmarking Trustworthiness for Embodied Navigation
Huaide Jiang, Yash Chaudhary, Yuping Wang, Zehao Wang, Raghav Sharma, Manan Mehta, Yang Zhou, Lichao Sun, Zhiwen Fan, Zhengzhong Tu, Jiachen Li

TL;DR
NavTrust introduces a comprehensive benchmark for evaluating embodied navigation systems under realistic input corruptions, revealing robustness gaps and testing mitigation strategies to improve real-world reliability.
Contribution
This work presents the first unified benchmark, NavTrust, systematically testing embodied navigation models against diverse input corruptions in RGB, depth, and instructions.
Findings
Significant performance drops under input corruptions highlight robustness issues.
Mitigation strategies can improve model robustness in corrupted scenarios.
Real-world deployment shows enhanced robustness with proposed methods.
Abstract
There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper establishes a unified and open-source benchmark that evaluates both VLN and OGN tasks. It jointly considers RGB, depth, and language modalities, offering a comprehensive view of multimodal robustness in embodied navigation. 2. Each modality includes a wide range of realistic corruptions—visual noise, depth sensor degradation, and linguistic perturbations—that better reflect the challenges encountered in real-world navigation scenarios. 3. Provides several mainstream robustness enh
1. The evaluation of depth corruptions focuses mainly on mapping-based methods. Including a broader range of approaches would provide a more comprehensive understanding of depth robustness. 2. In the mitigation stage, the analysis is conducted only on ETPNav, while comparisons across more models would strengthen the conclusions.
Novel Benchmark and Scope: The paper fills a clear gap in embodied AI evaluation by jointly assessing perceptual and linguistic robustness under a unified framework. Prior works such as RobustNav and EmbodiedBench handle only subsets of these aspects. Comprehensive Corruption Suite: The authors design realistic and diverse corruptions across RGB, depth, and instruction modalities, covering noise, occlusions, adversarial prompts, and stylistic rephrasings. Empirical Breadth: Evaluation spans si
Limited Theoretical Depth: While empirically comprehensive, the paper lacks a theoretical analysis of why certain models fail under specific corruption types (e.g., the deeper mechanisms linking architecture and robustness). Benchmark Generality: The benchmark is limited to the Matterport3D-based environments and English instructions (R2R dataset). This may restrict its generalizability to other datasets or real-world robotic settings. Evaluation of Mitigation Strategies: The four robustness s
The strengths of the paper can be summarized as: 1. The paper is easy to follow, and the presentation of the research results is clear and logically structured. 2. The work focuses on trustworthiness and robustness in embodied navigation, which is an underexplored yet critical area for real-world deployment of embodied systems. 3. The work systematically integrates RGB, depth, and language corruptions into a unified evaluation platform with public available evaluation protocols, which encourages
The weakness of the paper can be summarized as: 1. Novelty not sufficient: The paper mainly builds upon existing datasets (or said, simulators / environments) and evaluation settings (e.g., Matterport3D), and the technical innovations are relatively limited. Some parts that are overclaimed as contributions are in fact implementation details or engineering tricks rather than advances. 2. Limited analysis of corruptions: Although the paper implements a wide range of corruptions (e.g., motion blur,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Reinforcement Learning in Robotics
