MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation
Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva, Angel X. Chang

TL;DR
This paper introduces LangNav, a new dataset for evaluating language understanding in semantic navigation, and proposes MLFM, a multi-layered feature map method that improves zero-shot navigation performance by reasoning over detailed attributes and relations.
Contribution
The paper presents LangNav, a comprehensive dataset with fine-grained annotations, and introduces MLFM, a novel multi-layered feature map approach for enhanced language grounding in navigation tasks.
Findings
MLFM outperforms existing zero-shot navigation baselines.
LangNav enables systematic evaluation of language grounding.
MLFM effectively reasons over attributes and spatial relations.
Abstract
Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open-vocabulary multi-object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine-grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language.…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed LaMoN benchmark is well-motivated with clearly presented details. Since most zero-shot object navigation (ZSON) approaches are still evaluated and compared on limited object categories and coarse-grained descriptions, this benchmark offers a superior alternative. 2. The paper presents a novel modular approach, MLFM, which is generalizable to different object navigation benchmarks including LaMoN and GOAT, and achieves competitive results on both. 3. The paper conducts detailed fa
1. The paper lacks essential comparisons with state-of-the-art navigation foundation models (e.g., NaViD[1], NaViLA[2], StreamVLN[3]) that are capable of handling general navigation tasks. 2. The contributions and differences between the proposed LaMoN benchmark and recent object navigation benchmarks (e.g., DOZE[4]) are not clearly justified. 3. The performance of MLFM in real-world scenarios remains unclear.
1. Manual validation eliminates VLM hallucinations (a major flaw of GOAT-Bench), and fine-grained tags enable dimensional evaluation of language understanding. 2. By requiring sequential language-specified goals without path instructions, LaMoN’s task better simulates real human-robot interaction than step-by-step VLN tasks. 3. The multi-layer map avoids 3D’s high memory cost and 2D’s height information loss.
1. Dataset limitations undermine generalizability: (1) Scenes are restricted to synthetic HSSD and semi-real GOAT-Bench. This paper did not test on physical environments (e.g., real apartments), where lighting/occlusions differ sharply from synthetic data. (2) Language is overly simple: no complex structures (coreference, negation) or action directives (e.g., "pick up"), making it irrelevant to real-world tasks like home service. 2. MLFM’s attribute understanding is incomplete: (1) Texture attr
Overall, the problem setting is both realistic and practical, as it aligns closely with how users naturally specify sequential goals through language instructions.
In this setting, the task involves finding three sequential goals. I am curious about how this number was determined—have you experimented with different numbers of goals or tested the model’s performance under varying sequence lengths? What is the intuition behind choosing three? Additionally, it might be beneficial to provide a training dataset to further support the research community and enable more comprehensive exploration of this task.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation
