Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations
Koffivi Fid\`ele Gbagbe, Miguel Altamirano Cabrera, Ali Alabbas,, Oussama Alyunes, Artem Lykov, and Dzmitry Tsetserukou

TL;DR
This paper presents Bi-VLA, a comprehensive vision-language-action system enabling bimanual robots to understand complex instructions, perceive visual scenes, and perform dexterous household tasks with high accuracy and adaptability.
Contribution
The novel Bi-VLA system integrates vision, language, and action modules for robotic manipulation, demonstrating effective real-world task execution and high success rates.
Findings
100% success in generating correct executable code
96.06% accuracy in ingredient detection
83.4% overall task success rate
Abstract
This research introduces the Bi-VLA (Vision-Language-Action) model, a novel system designed for bimanual robotic dexterous manipulation that seamlessly integrates vision for scene understanding, language comprehension for translating human instructions into executable code, and physical action generation. We evaluated the system's functionality through a series of household tasks, including the preparation of a desired salad upon human request. Bi-VLA demonstrates the ability to interpret complex human instructions, perceive and understand the visual context of ingredients, and execute precise bimanual actions to prepare the requested salad. We assessed the system's performance in terms of accuracy, efficiency, and adaptability to different salad recipes and human preferences through a series of experiments. Our results show a 100% success rate in generating the correct executable code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning
