GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
Kun Wang, Yiming Li, Mingcheng Qu, Aqiang Zhang, Guang Yang, Tonghua Su

TL;DR
GaLa introduces a hypergraph-based multimodal framework that enhances procedural planning in embodied AI by explicitly modeling implicit spatial and semantic relations, leading to improved performance.
Contribution
The paper presents GaLa, a novel hypergraph-guided visual language model that explicitly captures semantic relations for better procedural planning in complex scenes.
Findings
GaLa outperforms existing methods on ActPlan1K and ALFRED benchmarks.
GaLa achieves higher execution success rate, LCS, and planning correctness.
The hypergraph encoder effectively injects semantic information into VLM reasoning.
Abstract
Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
