Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
Xiucheng Zhang, Yang Jiang, Hongwei Qing, Jiashuo Bai

TL;DR
This paper introduces a framework combining language-conditioned visual representations and mixture-of-experts policies to improve robustness and efficiency in multi-task robotic manipulation, addressing perceptual ambiguity and task conflict.
Contribution
It presents a novel integration of language grounding with a sparse expert policy architecture for multi-task robotic manipulation, outperforming existing baselines.
Findings
Boosts success rates of Action Chunking with Transformers by 33.75%
Increases Diffusion Policy success by 25%
Achieves 79% average success rate, surpassing baselines by 21%
Abstract
Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
