Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing
Muhamamd Haris Khan, Selamawit Asfaw, Dmitrii Iarchuk, Miguel, Altamirano Cabrera, Luis Moreno, Issatay Tokmurziyev, Dzmitry Tsetserukou

TL;DR
Shake-VLA is a comprehensive vision-language-action system that enables bimanual robots to automate cocktail making, integrating perception, language understanding, and precise manipulation for reliable liquid mixing.
Contribution
The paper presents Shake-VLA, a novel integrated system combining vision, language, and force sensing for robotic cocktail preparation, with high success rates in real-world experiments.
Findings
93% speech-to-text accuracy in noisy environments
91% object and label detection success rate
95% anomaly detection accuracy
Abstract
This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT) sensors are employed to precisely measure the quantity of liquid poured, ensuring accuracy in ingredient proportions during the mixing process. The system architecture includes a Retrieval-Augmented Generation (RAG) module for accessing and adapting recipes, an anomaly detection mechanism to address ingredient availability issues, and bimanual robotic arms for dexterous manipulation. Experimental evaluations demonstrated a high success rate across system components, with the speech-to-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
