Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation

Archit Sharma; Dharmendra Sharma; John Rebeiro; Peeyush Thakur; Narendra Dhar; Laxmidhar Behera

arXiv:2602.09940·cs.RO·February 11, 2026

Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation

Archit Sharma, Dharmendra Sharma, John Rebeiro, Peeyush Thakur, Narendra Dhar, Laxmidhar Behera

PDF

Open Access

TL;DR

This paper introduces Instruct2Act, a lightweight on-device system that converts natural language instructions into precise robotic manipulation actions, enabling real-time, resource-efficient robot control without cloud reliance.

Contribution

It presents a novel two-stage pipeline combining instruction parsing and trajectory generation, achieving high accuracy and real-time performance in resource-constrained robotic manipulation tasks.

Findings

01

91.5% sub-actions prediction accuracy on dataset

02

90% success rate in real-robot tasks

03

Inference time under 3.8 seconds per sub-action

Abstract

Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications