TL;DR
SOMA enhances vision-language-action models' robustness by integrating memory, causal attribution, and in-context adaptation, significantly improving success rates in out-of-distribution robotic tasks without fine-tuning.
Contribution
It introduces a novel system combining memory-augmented retrieval, LLM orchestration, and intervention protocols to improve VLA model robustness without parameter fine-tuning.
Findings
Achieves an average success rate gain of 56.6% across tested models.
Improves long-horizon task chaining success by 89.1%.
Demonstrates effectiveness on LIBERO-PRO and LIBERO-SOMA benchmarks.
Abstract
Despite the promise of Vision-Language-Action (VLA) models as generalist robotic controllers, their robustness against perceptual noise and environmental variations in out-of-distribution (OOD) tasks remains fundamentally limited by the absence of long-term memory, causal failure attribution, and dynamic intervention capability. To address this, we propose SOMA, a Strategic Orchestration and Memory-Augmented System that upgrades frozen VLA policies for robust in-context adaptation without parameter fine-tuning. Specifically, SOMA operates through an online pipeline of contrastive Dual-Memory Retrieval-Augmented Generation (RAG), an Attribution-Driven Large-Language-Model (LLM) Orchestrator, and extensible Model Context Protocol (MCP) interventions, while an offline Memory Consolidation module continuously distills the execution traces into reliable priors. Experimental evaluations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
