Shake-VLA: Vision-Language-Action Model-Based System for Bimanual   Robotic Manipulations and Liquid Mixing

Muhamamd Haris Khan; Selamawit Asfaw; Dmitrii Iarchuk; Miguel; Altamirano Cabrera; Luis Moreno; Issatay Tokmurziyev; Dzmitry Tsetserukou

arXiv:2501.06919·cs.RO·January 14, 2025

Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

Muhamamd Haris Khan, Selamawit Asfaw, Dmitrii Iarchuk, Miguel, Altamirano Cabrera, Luis Moreno, Issatay Tokmurziyev, Dzmitry Tsetserukou

PDF

TL;DR

Shake-VLA is a comprehensive vision-language-action system that enables bimanual robots to automate cocktail making, integrating perception, language understanding, and precise manipulation for reliable liquid mixing.

Contribution

The paper presents Shake-VLA, a novel integrated system combining vision, language, and force sensing for robotic cocktail preparation, with high success rates in real-world experiments.

Findings

01

93% speech-to-text accuracy in noisy environments

02

91% object and label detection success rate

03

95% anomaly detection accuracy

Abstract

This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT) sensors are employed to precisely measure the quantity of liquid poured, ensuring accuracy in ingredient proportions during the mixing process. The system architecture includes a Retrieval-Augmented Generation (RAG) module for accessing and adapting recipes, an anomaly detection mechanism to address ingredient availability issues, and bimanual robotic arms for dexterous manipulation. Experimental evaluations demonstrated a high success rate across system components, with the speech-to-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.